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QUANTIZATION  EFFECTS  IN  DIGITAL  FILTERS 


ABSTRACT 

When  a  digital  filter  is  implemented  on  a  computer  or  with  special-purpose  hardware,  errors 
and  constraints  due  to  finite  word  length  are  unavoidable.  These  quantization  effects  must  be 
considered,  both  in  deciding  what  register  length  is  needed  for  a  given  filter  implementation 
and  in  choosing  between  several  possible  implementations  of  the  same  filter  design,  which 
will  be  affected  differently  by  quantization. 

Quantization  effects  in  digital  filters  can  be  divided  into  four  main  categories:  quantization  of 
system  coefficients,  errors  due  to  analog-digital  (A-D)  conversion,  errors  due  to  roundoffs  in 
the  arithmetic,  and  a  constraint  on  signal  level  due  to  the  requirement  that  overflow  be  pre¬ 
vented  in  the  computation. 

The  effects  of  these  errors  and  constraints  will  vary,  depending  on  the  type  of  arithmetic  used. 
Fixed  point,  floating  point,  and  block  floating  point  are  three  alternate  types  of  arithmetic 
often  employed  in  digital  filtering. 

A  very  large  portion  of  the  computation  performed  in  digital  filtering  is  composed  of  two  basic 
algorithms  -  the  first-  or  second-order,  linear,  constant  coefficient,  recursive  difference 
equation;  and  computation  of  the  discrete  Fourier  transform  (DFT)  by  means  of  the  fast 
Fourier  transform  (FFT).  These  algorithms  serve  as  building  blocks  from  which  the  most 
complicated  digital  filtering  systems  can  be  constructed. 

The  effects  of  quantization  on  implementations  of  these  basic  algorithms  are  studied  in  some 
detail.  Sensitivity  formulas  are  presented  for  the  effects  of  coefficient  quantization  on  the 
poles  of  simple  recursive  filters.  The  mean-squared  error  in  a  computed  DFT,  due  to  coef¬ 
ficient  quantization  in  the  FFT,  is  estimated.  For  both  recursions  and  the  FFT,  the  differing 
effects  of  fixed  and  floating  point  coefficients  are  investigated.  Statistical  models  for  roundoff 
errors  and  A-D  conversion  errors,  and  linear  system  noise  theory,  are  employed  to  estimate 
output  noise  variance  in  simple  recursive  filters  and  in  the  FFT.  By  considering  the  overflow 
constraint  in  conjunction  with  these  noise  analyses,  output  noise-to-signal  ratios  are  derived. 
Noise-to-signal  ratio  analyses  are  carried  out  for  fixed,  floating,  and  block  floatingpoint 
arithmetic,  and  the  results  are  compared. 


*  This  report  is  based  on  a  thesis  of  the  same  title  submitted  to  the  Department  of  Electrical 
Engineering  at  the  Massachusetts  Institute  of  Technology  on  31  July  1969  in  partial  fulfillment 
of  the  requirements  for  the  degree  of  Doctor  of  Philosophy. 


All  the  noise  analyses  are  based  on  simple  statistical  models  for  roundoff  errors  (and  A-D 
conversion  errors).  Of  course,  somewhat  different  models  are  applied  for  the  different  types 
of  arithmetic.  These  models  cannot  in  general  be  verified  theoretically,  and  thus  one  must 
resort  to  experimental  noise  measurements  to  support  the  predictions  obtained  via  the  models. 
A  good  deal  of  experimental  data  on  noise  measurements  is  presented  here,  and  the  empirical 
results  are  generally  in  good  agreement  with  the  predictions  based  on  the  statistical  models. 

The  ideas  developed  in  the  study  of  simple  recursive  filters  and  the  FFT  are  applied  to  analyze 
quantization  effects  in  two  more  complicated  types  of  digital  filters  —  frequency  sampling  and 
FFT  filters.  The  frequency  sampling  filter  is  realized  by  means  of  a  comb  filter  and  a  bank 
of  second-order  recursive  filters;  while  an  FFT  filter  implements  a  convolution  via  an  FFT, 
a  multiplication  in  the  frequency  domain,  and  an  inverse  FFT.  Any  finite  duration  impulse 
response  filter  can  be  realized  by  either  of  these  methods.  The  effects  of  coefficient  quanti¬ 
zation,  roundoff  noise,  and  the  overflow  constraint  are  investigated  for  these  two  filter  types. 
Through  use  of  a  specific  example,  realizations  of  the  same  filter  design,  by  means  of 
the  frequency  sampling  and  FFT  methods,  are  compared  on  the  basis  of  differing  quantiza¬ 
tion  effects. 


Accepted  for  the  Air  Force 

Franklin  C.  Hudson 

Chief,  Lincoln  Laboratory  Office 
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QUANTIZATION  EFFECTS  IN  DIGITAL  FILTERS 


I.  INTRODUCTION 

In  recent  years,  there  has  been  a  great  upsurge  of  interest  in  digital  filtering,  or  the  process 
of  spectrum  shaping  or  spectrum  analysis  via  digital  operations  on  the  samples  of  a  signal.  Some 
important  factors  motivating  this  eoneern  have  been:  (a)  the  widespread  use  of  computer  simu¬ 
lation  in  the  design  of  analog  systems;  (b)  application  of  the  general-purpose  computer  to  com¬ 
plex  signal  processing  tasks,  such  as  processing  of  seismic  data  or  of  pictures;  and  (c)  increas¬ 
ing  speed  and  decreasing  cost  and  size  of  digital  circuitry,  making  real-time  digital  filters  fea¬ 
sible  and  practical.  In  addition,  the  discovery  of  the  fast  Fourier  transform  and  its  application 
to  high  speed  convolution  has  added  a  new  dimension  to  the  field  of  digital  filtering,  allowing 
efficient  realization  of  filter  designs  previously  considered  impractical  due  to  the  excessive  com¬ 
putation  time  needed  for  their  implementation. 

The  task  of  obtaining  a  digital  filter  for  a  specified  purpose  can  be  divided  roughly  into  two 
phases  —  design  and  implementation.  In  design,  as  in  the  approximation  problem  for  the  analog 
case,  one  seeks  a  rational  transfer  function  of  finite  order  which  approximates  to  a  satisfactory 
tolerance  some  desired  frequency  response  characteristic  (or  one  may  seek  to  approximate  some 
desired  time  response).  In  implementation,  or  synthesis,  one  seeks  a  numerical  algorithm  or 
hardware  configuration  which  will  efficiently  implement  this  transfer  function.  Clearly,  the  de¬ 
sign  and  implementation  phases  arc  not  quite  independent,  as  the  designer,  given  a  choice,  would 
seek  those  designs  which  could  be  most  efficiently  implemented. 

This  report  will  focus  on  an  issue  of  central  concern  in  digital  filter  implementation,  namely, 
the  problem  of  quantization  in  digital  filters.  Whether  a  digital  filter  is  implemented  on  a 
general-purpose  computer  or  with  special-purpose  hardware,  finite  register  length  arithmetic 
must  be  used.  One  must  take  careful  account  of  the  quantization  inherent  in  this  arithmetic,  both 
in  deciding  what  register  length  is  needed  for  a  given  implementation  and  in  choosing  between 
several  possible  implementations  of  the  same  filter  design,  which  will  be  affected  differently  by 
quantization. 

To  illustrate  a  few  of  the  options  available  for  implementing  a  given  filter  design,  consider 
a  digital  transfer  function 

M 

2  a.z-1 

H(z>  — ^ - :  (1-D 

1  +  S  b.z'1 
i=  1  1 

where  the  designer  has  specified  the  a.,  b.,  N  (the  number  of  poles),  and  M  (the  number  of  zeros. 
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excluding  the  trivial  zeros  at  z  =  0  and  z  =  «°).  This  filter  can  be  realized  via  a  single  N^-order 
recursive  difference  equation 

N  M 

y(n)  =  2  biy(nT  -  IT)  +  Tj  a.x(nT  -  iT)  (1-2) 

i=  1  i=  0 

where  the  current  output  y(nT)  is  obtained  from  a  linear  combination  of  past  outputs,  and  current 
and  past  inputs  (T  is  the  sampling  interval).  This  realization  is  usually  called  the  direct  form. 

As  an  alternative,  one  can  factor  the  rational  fraction  in  (1-1 )  in  terms  of  its  zeros  q.  and  poles  p^ 
as 

M  -1 
II  (1  -  q^z  ) 

H(z)  =  K  ^ -  (1-3) 

n  d-p.z"1) 
i=l  1 

and  synthesize  the  filter  as  a  cascade  of  simple  (first-  or  second-order)  recursions,  each  cor¬ 
responding  to  a  factor  or  group  of  factors  in  (1-3).  Another  possible  form  of  realization  could 
be  obtained  by  expanding  Il(z)  as  a  partial  fraction  sum,  and  synthesizing  the  filter  as  a  parallel 
combination  of  simple  filters  representing  the  terms  in  the  sum. 

There  are  many  other  types  of  algorithms  which  can  be  used  to  implement  a  given  Il(z).  In 
addition,  there  are  various  types  of  arithmetic  (e.g.,  fixed  or  floating  point)  which  can  be  used. 

All  realizations  would  yield  the  same  filter,  except  for  the  fact  that  the  errors  due  to  quantiza¬ 
tion  are  different  for  the  various  realizations. 

A.  OUTLINE  OF  PROBLEM  AREA 

Broadly,  the  aim  of  this  report  is  to  study  the  effects  of  quantization  in  the  various  types  of 
digital  filter  implementations,  and  thus  to  provide  insight  into  how  one  efficiently  implements 
digital  filters  in  view  of  quantization  problems.  To  set  a  framework  for  the  study,  we  need  to 
define  more  explicitly  this  rather  broad  problem  area.  To  this  end,  we  now  outline  (1)  the  basic 
computational  algorithms  which  are  used  in  digital  filtering,  (2)  the  various  types  of  arithmetic 
which  are  used  in  the  computation,  and  (3)  the  various  quantization  effects  which  occur  when 
these  algorithms  are  implemented  on  a  finite  word  length  machine. 

A  basic  numerical  algorithm  for  digital  filtering  is  the  first-  or  second-order  linear,  con¬ 
stant  coefficient,  recursive  difference  equation.  Usually,  one  would  be  wise  to  construct  a  com¬ 
plicated  recursive  digital  filter  (having  poles,  as  well  as  zeros,  in  its  transfer  function)  as  a 
combination  of  first-  and  second-order  equations,  since  such  a  realization  has  been  found  to  be 
less  sensitive  to  quantization  effects  than  a  realization  using,  say,  a  single  higher-order  differ¬ 
ence  equation.  The  second  basic  numerical  algorithm  for  digital  filtering  is  the  fast  Fourier 
transform  (FFT),  which  permits  rapid  computation  of  the  discrete  Fourier  transform  (DFT)  of 
a  signal.  This  algorithm  can  be  used,  for  example,  to  perform  digital  spectrum  analysis,  or 
to  implement  any  nonrecursive  digital  filter  via  the  route  of  fast  convolution.  A  very  large  por¬ 
tion  of  the  computation  performed  in  digital  filtering,  and,  in  fact,  in  digital  signal  processing, 
is  composed  of  these  two  basic  algorithms  —  the  simple  recursive  difference  equation  and  the 
FFT.  Thus,  a  sizable  portion  of  this  report  will  be  devoted  to  studying  the  effects  of  quantiza¬ 
tion  on  realization  of  these  algorithms. 
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Two  basie  types  of  arithmetic  —  fixed  and  floating  point  —  ean  be  used  in  the  implementation 
of  these  digital  filtering  algorithms.  The  errors  and  constraints  due  to  finite  word  length  are 
different  for  these  two  types  of  arithmetic,  and  both  will  be  studied  here.  In  addition,  a  hybrid 
seheme  known  as  bloek  floating  point,  where  a  single  exponent  is  used  for  an  entire  bloek  of  num¬ 
bers,  has  been  found  useful  for  both  the  FFT  and  recursive  difference  equations.  The  quantiza¬ 
tion  effeets  for  bloek  floating  point  arithmetic  also  will  be  studied. 

The  effeets  of  quantization  on  digital  filter  performance  ean  be  divided  into  four  key  catego¬ 
ries.  The  first  three  ean  be  classified  as  quantization  errors,  and  the  fourth  as  a  constraint 
imposed  by  the  finite  word  length.  All  the  quantization  effeets  to  be  listed  oeeur  both  in  differ¬ 
ence  equations  and  in  FFT  implementation,  and  affeet  filter  performance  (in  different  ways)  for 
any  of  the  three  types  of  arithmetic  mentioned  above.  The  first  elass  of  quantization  errors  we 
will  mention  is  the  error  eaused  by  discretization  of  the  fixed  system  eoeffieients  —  the  constant 
multipliers  in  our  difference  equations  or  the  sine-eosine  table  for  the  FFT.  Usually,  a  filter 
is  designed  on  the  basis  of  perfectly  aeeurate  eoeffieients;  the  quantization  of  the  eoeffieients  to 
finite  word  length  means  that,  in  praetiee,  we  will  eonstruet  a  slightly  different  filter,  just  as 
when  a  precisely  designed  analog  filter  is  built  from  impreeise  components.  If  our  digital  filter 
is  used  to  proeess  an  analog  signal,  a  seeond  type  of  quantization  error  is  incurred  in  the  analog- 
digital  (A-D)  conversion  of  the  signal.  The  third  souree  of  quantization  errors  is  the  roundoff 
of  the  results  of  multiplications  and  additions  used  in  the  filter  implementation.  A-D  conversion 
and  roundoff  errors  are  quite  similar  in  that  both  ean,  for  many  praetieal  eases,  be  analyzed 
as  additive  noise  superposed  on  an  otherwise  linear  system.  The  fourth  important  quantization 
effeet  is  the  limit  imposed  by  the  finite  word  length  on  the  range  of  numbers  whieh  ean  be  rep¬ 
resented  in  the  filter.  This  limit  forces  us  to  arrange  our  filter  implementation  to  prevent  over¬ 
flow.  This  constraint  is  mueh  more  severe  in  fixed  point  than  in  floating  or  bloek  floating  point, 
where  the  presence  of  an  exponent  allows  for  a  mueh  larger  dynamie  range.  The  limit  on  signal 
level  due  to  the  overflow  constraint  must  be  considered,  together  with  the  noise  of  roundoff  and 
A-D  conversion,  to  determine  the  noise-to-signal  ratio  we  would  expeet  at  the  output  of  a  digital 
filter. 

This  completes  our  brief  summary  of  the  basie  problem  area  of  quantization  in  digital  filters. 
A  large  portion  of  this  report  is  devoted  to  a  study  of  the  various  quantization  effeets  in  simple 
recursive  filters  and  FFT's.  In  the  remainder  of  the  report,  the  ideas  developed  are  applied  to 
analyze  quantization  effeets  in  two  more  eomplieated  types  of  digital  filters  —  frequency  sampling 
filters  and  FFT  filters.  From  an  input-output  viewpoint,  these  are  both  nonreeursive  type  fil¬ 
ters,  whieh  is  equivalent  to  saying  that  they  have  only  zeros  in  their  transfer  function  or  have 
impulse  responses  whieh  are  nonzero  only  for  a  finite  period  of  time.  But  the  aetual  realization 
of  the  two  filter  types  is  quite  different,  as  the  frequency  sampling  filter  utilizes  a  eomb-filter 
and  a  band  of  seeond-order  recursive  filters,  while  an  FFT  filter  implements  a  convolution  via 
an  FFT,  a  multiplication  in  the  frequency  domain,  and  an  inverse  FFT.  One  of  the  reasons  for 
selecting  these  particular  filter  types  for  study  is  that  any  finite  duration  impulse  response  fil¬ 
ter  ean  be  realized  either  via  the  frequency  sampling  method  or  via  the  FFT.  These  two  alter¬ 
nate  realizations  of  the  same  filter  ean  be  compared  on  the  basis  of  differing  quantization  effeets. 
In  addition,  excellent  filter  designs  ean  be  implemented  via  either  technique,  and  quantization 
problems  in  both  are  fairly  severe  and  quite  interesting. 

The  problem  area,  as  defined  above,  ean  be  summarized  in  the  following  outline  of  key 
topies 
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Basic  Computational  Algorithms 


First-  and  second-order  recursive  difference  equations 
Fast  Fourier  transform 

Types  of  Arithmetic 

Fixed  point 
Block  floating  point 
Floating  point 

Quantization  Effects 

Coefficient  quantization 
Analog-to-digital  converter  noise 
Roundoff  errors  in  computation 
Overflow  constraint 

Filter  Types  to  be  Studied 

Frequency  sampling  filters 

FFT  filters  (high-speed  convolution) 

B.  REVIEW  OF  PREVIOUS  WORK 

In  this  section,  we  report  on  previous  work  relevant  to  a  study  of  quantization  in  digital  fil¬ 
ters.  In  addition  to  reporting  on  specific  studies  of  the  quantization  problem,  some  background 
material  (digital  filter  design,  description  of  the  FFT  algorithm,  quantization  theory)  is  refer¬ 
enced.  An  attempt  is  made  to  indicate  where  the  work  to  be  described  in  this  report  conforms 
with  previous  studies. 

1.  Background  on  Digital  Filters  and  FFT 

1 

Kaiser  has  reviewed  in  some  detail  the  history  of  digital  filters,  and  presents  a  large  and 

thorough  bibliography.  He  discusses  in  detail  various  filter  design  techniques  and  quantization 

2 

effects,  with  particular  attention  to  coefficient  quantization.  A  book  by  Gold  and  Rader  provides 
a  thorough  introduction  to  the  whole  field  of  digital  signal  processing.  Their  chapter  on  quanti¬ 
zation  effects  is  an  excellent  overview  of  these  problems.  Also  included  in  the  book  is  a  devel¬ 
opment  of  the  basic  theory  needed  to  analyze  digital  linear  systems,  techniques  for  digital  filter 
design  including  a  description  of  frequency  sampling  filters,  a  chapter  explaining  the  FFT  algo¬ 
rithm,  and  a  chapter  (written  by  Stockham)  describing  how  the  FFT  may  be  used  to  implement 

convolution.  Two  basic  and  thorough  papers  on  digital  filter  design  techniques  have  been  written 
3  4 

by  Golden  and  Kaiser  and  Rader  and  Gold.  The  original  disclosure  of  the  FFT  algorithm  was 

5  6 

by  Cooley  and  Tukey,  and  a  later  paper  by  Cochran,  et  al.,  describes  and  elucidates  many  of 
the  various  forms  of  the  algorithm.  Stockham' s  paper^  is  the  earliest  presentation  of  the  tech¬ 
niques  of  digital  filtering  via  the  FFT. 

2.  Coefficient  Quantization 

Previous  work  on  the  effects  of  coefficient  quantization  has  concentrated  on  the  coefficient 
sensitivity  problem  for  filters  realized  via  recursive  difference  equations.  Most  of  the  work 
done  is  applicable  to  either  fixed  or  floating  point  filters.  The  relationship  of  changes  in  filter 
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parameters  (e.  g. ,  pole  positions)  to  changes  in  the  coefficients  is  the  same  for  either  fixed  or 
floating  point,  although  the  actual  errors  due  to  quantizing  the  coefficients  may  be  different. 

g 

Kaiser  studied  the  sensitivity  to  coefficient  quantization  of  the  pole  positions  of  a  digital 
filter  realized  via  a  single  N^-order  difference  equation.  He  concluded  that,  at  least  for  low- 
pass  or  bandpass  filters  with  tightly  clustered  poles,  the  sensitivity  of  the  pole  positions  in¬ 
creased  markedly  with  the  order  N  of  the  equation.  By  an  approximate  analysis  he  indicates, 
for  example,  that  the  number  of  bits  which  must  be  retained  in  the  filter  coefficients  to  prevent 
instability  can  grow  linearly  udth  N.  On  the  basis  of  these  results,  Kaiser  suggests  that  to  avoid 
severe  changes  from  a  desired  frequency  response  function,  or  even  instability,  one  should  use 
a  cascade  or  parallel  arrangement  of  low-order  recursions  (rather  than  a  single  high-order  re¬ 
cursion)  to  synthesize  any  reasonably  complex  filter  with  sharp  transitions  between  pass  and 
stop  bands. 

9 

Rader  and  Gold  considered  in  detail  the  coefficient  sensitivity  problem  for  a  second-order 
filter,  and  presented  a  realization  via  a  pair  of  coupled  first-order  equations  which  was  less 
sensitive  to  coefficient  quantization  than  a  single  second-order  equation.  This  coupled  form  can 

be  used  as  a  building  block  for  a  cascade  or  parallel  realization  of  a  more  complex  system. 

1 0 

Knowles  and  Olcayto  present  a  very  interesting  approach  to  the  coefficient  quantization 
problem,  modeling  the  actual  finite  precision  filter  as  a  parallel  combination  of  the  original  in¬ 
finite  precision  transfer  function,  and  a  stray  transfer  function  to  account  for  coefficient  quan¬ 
tization.  They  apply  their  model  to  analyze  the  effects  of  coefficient  quantization  for  direct, 
cascade,  and  parallel  form  filter  realizations.  They  carry  out  a  series  of  experiments  on  a 
particular  filter  design  (realized  via  direct,  cascade,  and  parallel  forms)  to  check  the  predicted 
deviation  between  the  frequency  responses  of  the  finite  and  infinite  precision  filters.  The  agree¬ 
ment  of  experiment  v'ith  theory  is  excellent.  For  the  filter  example  they  consider,  the  direct 
form  is  by  far  the  most  sensitive  to  coefficient  quantization,  and  thus  requires  the  most  bits  for 

a  satisfactory  realization.  Thus,  for  this  example,  the  work  of  Kaiser  is  corroborated. 

i  1 

Mantey  studies  the  coefficient  quantization  problem  from  the  viewpoint  of  trying  to  select 
a  state  variable  representation  for  the  filter  which  is  least  sensitive  to  coefficient  quantization. 
The  state  variable  representations  he  compares  are  essentially  the  direct,  cascade,  and  parallel 
forms,  and  his  results  also  indicate  that  a  cascade  or  parallel  realization  is  less  sensitive  to 
coefficient  quantization  than  a  direct  realization. 

The  key  conclusion  of  all  these  workers  seems  to  be  that  in  order  to  minimize  coefficient 
quantization  problems,  implementation  of  high-order  difference  equations  should  be  avoided. 
Complicated  filters  should  be  implemented  via  combinations  of  simple  difference  equations. 

An  interesting  application  of  some  of  the  ideas  of  these  previous  workers  is  the  study  of  co¬ 
efficient  quantization  in  frequency  sampling  filters.  In  addition  to  the  present  description  of 

1  2 

work  on  this  problem,  1  have  also  reported  upon  it  previously. 

With  respect  to  the  problem  of  coefficient  quantization  in  the  FFT,  or  in  FFT  filters, 
little  previous  work  has  been  done.  Some  work  on  this  problem  will  be  reported  in  Secs.  Ill-A 
and  V-A. 


3.  Quantization  Noise  Models 

Previous  analyses  of  the  effects  of  noise  (due  either  to  A-D  conversion  or  roundoff)  in  digital 
filters  have  generally  been  based  on  a  simple  statistical  model.  That  is,  it  has  been  assumed 
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that  errors  of  roundoff  or  A-D  conversion  have  a  uniform  first-order  probability  density  function, 
are  uncorrelated  from  sample  to  sample,  and  are  uncorrelated  with  the  signal,  and  that  all  the 
noise  sources  in  the  digital  network  are  independent  of  one  another.  These  assumptions,  though 
false  for  many  pathological  cases,  have  been  supported  by  previous  workers  for  a  large  class 
of  signals. 

1 3 

Bennett,  by  a  somewhat  involved  analysis,  derived  the  correlation  function  of  quantization 
noise  for  Gaussian  signals.  He  showed  that  for  quantization  steps  small  compared  with  signal 
variance,  there  is  virtually  no  correlation  between  successive  noise  samples  except  when  the 
correlation  between  successive  signal  samples  is  very  high.  He  presents  experimental  results 
in  which  the  spectrum  of  quantization  noise  was  measured  for  quantization  of  sine  waves  and  of 
random  signals.  In  both  cases,  the  assumption  of  white  quantization  noise  was  justified  for  suf¬ 
ficiently  small  quantization  steps. 

14 

Widrow  presented  a  formalism  by  which  the  statistics  of  the  noise  due  to  quantization  of 

random  signals  could  be  analyzed,  given  the  statistics  of  the  signals.  The  key  to  the  approach 

was  in  viewing  quantization  as  a  sampling  process  on  the  probability  density  function  of  the  input. 

He  carried  the  analysis  through  for  Gaussian  signals,  and  corroborated  the  results  of  Bennett. 

15 

Knowles  and  Edwards  worked  out  theoretically  the  autocorrelation  function  of  the  noise 
introduced  by  quantizing  a  sine  wave  with  random  phase.  The  key  step  in  their  analysis  is  an 
expansion  of  the  second-order  probability  density  function  of  the  input  in  a  series  of  orthogonal 
functions.  The  results,  which  they  verify  experimentally,  are  that  even  for  very  coarse  quan¬ 
tization  (4  or  5  bits)  the  error  spectra  are  essentially  white.  The  signals  flowing  through  a  dig¬ 
ital  filter  will  quite  often  be  of  a  periodic  rather  than,  say,  a  Gaussian  nature,  and  this  result 
somewhat  justifies  the  assumption  of  white  quantization  noise  in  such  a  case. 

Sornmoonpin1^  presents  a  great  deal  of  experimental  evidence  corroborating  the  simple 
white  noise  model  for  roundoff  errors. 

These  previous  studies  relate  to  statistical  models  for  fixed  point  roundoff  errors,  and  not 

to  floating  point  roundoff  errors.  The  floating  point  case  is  more  complicated,  since  the  expected 

magnitude  of  roundoff  errors  depends  on  signal  amplitude,  and  thus  is  not  signal  independent  as 

in  the  fixed  point  case.  Basic  error  models  for  floating  point  computation  are  presented  in 
1 7 

Wilkinson.  As  will  be  seen,  a  statistical  version  of  these  models  can  be  used  successfully  to 
predict  roundoff  noise  in  digital  filters. 

4.  Statistical  Theory  of  Noise  Measurements 

Although  there  is  some  solid  basis  for  the  simple  noise  models  mentioned  above,  they  are 
not  valid,  in  general,  and  can  be  contradicted  by  counter  examples.  Ultimately,  one  must  resort 
to  experiment  to  verify  the  predictions  of  the  models.  The  measurements  are  statistical  in  na¬ 
ture,  and  their  understanding  is  based  on  a  statistical  theory  of  noise  measurements. 

A  strong  background  in  this  theory,  including  the  theory  of  spectral  estimation  for  the  dis- 

1 B 

Crete  case,  is  presented  in  Grenander  and  Rosenblatt.  Some  modern  and  practical  procedures, 

by  which  one  might  use  a  digital  computer  for  estimating  noise  correlation  functions  and  spectra, 

1  9 

are  presented  in  Cooley,  Lewis,  and  Welch.  These  methods  center  on  use  of  the  FFT  algorithm 
for  noise  analysis. 
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5.  Noise  in  Digital  Filters 

Starting  with  the  basic  model  discussed  above,  several  workers  have  analyzed  the  effects  of 
A-I)  conversion  noise  and  roundoff  noise  on  digital  filter  performance.  Generally,  noise  meas¬ 
urements  have  been  used  to  verify  the  theory.  Noise  effects  have  been  studied  both  for  recursive 
difference  equation  filters  and,  to  a  lesser  extent,  for  the  FFT.  Although  most  previous  work 
has  concentrated  on  the  fixed  point  case,  some  work  on  floating  point  and  block  floating  point  has 
been  done. 

The  first  three  papers  to  be  discussed  deal  with  quantization  noise  in  fixed  point  recursive 
filters. 

20 

Knowles  and  Edwards  used  the  white  model  for  quantization,  and  linear  system  noise  the¬ 
ory,  to  examine  noise  effects  in  direct,  cascade,  and  parallel  recursive  filter  implementations. 
They  worked  out  bounds  on  the  expected  mean-square  output  noise  in  each  type  of  realization, 
and  carried  out  noise  measurements  which,  to  a  reasonable  extent,  verified  the  theory.  Care 
was  taken  to  obtain  statistically  convergent  noise  measurements.  No  general  statement  was  ven¬ 
tured  about  the  realization  forms,  since  only  a  few  filter  examples  were  considered  in  detail. 

21 

Gold  and  Rader  used  the  same  model  to  derive  convenient  formulas  for  evaluating  total 
(accounting  for  all  noise  sources)  mean-squared  output  noise  in  digital  filters.  They  analyze 
carefully  the  noise  effects  in  first-  and  second-order  filters,  and  present  experimental  results. 
Analysis  of  the  computation  time  necessary  for  a  statistically  consistent  measurement  of  output 
noise  power  is  carried  out.  It  is  shown  that,  for  very  lightly  damped  filters,  this  measurement 

time  becomes  prohibitive. 

22 

Gold  and  Rabiner  carry  out  detailed  analyses,  both  theoretical  and  experimental,  of  noise 
effects  in  a  digital  format  synthesizer,  realized  via  a  cascade  of  digital  resonators.  An  attempt 
is  made  to  select  that  arrangement  which  produces  acceptable  vowel  outputs,  for  the  least  reg¬ 
ister  length.  Overflow  constraints  and  signal  dynamic  range  requirements  are  considered,  along 
with  the  noise  problems,  in  selecting  among  various  possible  filter  arrangements. 

To  complete  the  analysis  of  noise  effects  in  a  fixed  point  filter,  it  is  necessary  that  one  con¬ 
sider,  along  with  the  noise,  the  limit  on  signal  level  due  to  the  overflow  constraint.  Later,  in 
Sec.  Il-C,  a  formal  approach  to  this  problem  will  be  presented  and  output  noise-to-signal  ratios 
for  digital  filters  will  be  derived. 

An  analysis  of  roundoff  errors  in  floating  point  recursive  filters  has  been  presented  by 
23 

Sandberg.  As  mentioned  above,  floating  point  roundoff  errors  differ  from  fixed  point  in  that 

the  expected  magnitude  of  the  error  due  to  a  roundoff  is  proportioned  to  signal  amplitude.  The 

result  of  this  is  that,  to  perform  a  statistical  analysis  of  floating  point  roundoff  noise,  one  must 

assume  a  statistical  model  for  the  signal,  as  well  as  for  the  roundoff  errors.  Sandberg  chooses 

instead  to  use  a  nonstatistical  approach,  and  derives  deterministic  bounds  on  output  noise-to- 

signal  ratio.  The  difficulty  with  these  bounds  is  that  they  apply  to  the  situation  where  all  errors 

add  up  in  the  worst  possible  way,  and  thus  are  quite  pessimistic  compared  with  results  which 

24 

actually  occur  in  practice.  Kaneko  and  Liu  analyze  floating  point  roundoff  noise  via  the  sta- 

25 

tistical  approach.  The  author,  in  collaboration  with  Oppenheim,  has  applied  the  method  of 
Kaneko  and  Liu  to  analyze  some  simple  recursive  filters,  and  has  experimentally  verified  the 
model.  This  work,  which  also  includes  a  comparison  of  the  noise-to-signal  ratios  of  fixed  and 
floating  point  filters,  will  be  discussed  in  Sec.  II. 
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Oppenheim  has  studied  the  effects  of  roundoff  noise  in  simple  recursive  filters  realized 
with  block  floating  point  arithmetic.  These  results,  and  a  discussion  of  the  comparison  block 
floating  point  with  fixed  and  floating  point,  will  be  reviewed  in  Sec.  Il-E. 

Although  most  of  the  previous  work  on  quantization  noise  in  digital  filters  has  treated  re¬ 
cursive  difference  equation  realizations,  some  study  has  been  devoted  to  the  problem  of  round- 

27 

off  errors  in  the  FFT.  Welch  proposes  and  analyzes  a  model  for  roundoff  noise  in  fixed  point 

FFT.  He  also  considers  block  floating  point  FFT,  and  obtains  an  upper  bound  on  the  output  noise- 

to-signal  ratio  by  considering  the  case  where  the  array  must  be  rescaled  at  each  stage.  Some 

experimental  results,  which  are  in  reasonable  agreement  with  the  theory,  arc  presented.  In 

Sec.  III-C,  some  additional  work  on  fixed  point  FFT,  and  a  noise  analysis  of  block  floating  point 

FFT  which  takes  signal  statistics  into  account,  will  be  presented. 

28 

Gentleman  and  Sande  studied  the  problem  of  roundoff  errors  in  a  floating  point  FFT.  They 

chose  to  derive  a  deterministic  bound  on  the  output  noise-to-signal  ratio,  rather  than  using  a 

29 

statistical  approach.  Their  bound  is  rather  pessimistic,  and  the  author  has  used  a  statistical 
approach  to  predict  output  noise-to-signal  ratio  in  an  FFT.  This  analysis,  and  experiments  con¬ 
firming  the  theory,  will  be  presented  in  Sec.  Ill—  D.  Gentleman  and  Sande  also  compared  the  ac¬ 
curacy  of  a  floating  point  FFT  to  a  floating  point  DFT  via  a  direct  summing  of  products,  and  con¬ 
clude  that  the  FFT  is  significantly  more  accurate.  A  method  for  arranging  the  summing  of  prod¬ 
ucts,  so  that  the  "slow"  Fourier  transform  is  essentially  as  accurate  as  the  FFT,  will  be  dis¬ 
cussed  later  in  Sec.  Ill— D. 

For  the  present  report,  the  techniques  of  statistical  noise  analysis,  along  with  consideration 
of  overflow  constraints,  will  be  used  to  derive  expected  noise-to-signal  ratios  for  frequency  sam¬ 
pling  and  FFT  filters.  Some  work  which  will  be  reported  relating  to  consideration  of  the  over¬ 
flow  constraint  in  FFT  filters  has  already  been  published  by  Oppenheim  and  the  author. 

In  general,  the  approach  we  shall  use  for  the  analysis  of  roundoff  errors  will  be  a  statistical 

approach.  There  are  effects  of  roundoff,  important  in  some  situations,  which  cannot  be  analyzed 

31 

statistically.  Such  effects  as  the  dead  band  effect  and  limit  cycles  will,  in  general,  not  be  dis¬ 
cussed  here. 

C.  ROUNDOFF  ERROR  MODELS 

The  fundamental  source  of  errors  incurred  during  the  actual  digital  filtering  computation 
is  the  error  caused  when  the  result  of  a  multiplication  or  an  addition  must  be  rounded  to  a  word 
length  smaller  than  that  needed  to  represent  the  exact  result.  To  analyze  this  problem,  one 
usually  assumes  a  simple  statistical  model  for  the  roundoff  errors,  and  employs  linear  system 
noise  theory  to  analyze  their  effect  on  the  filter  output.  Different  roundoff  error  models  must 
be  used  for  fixed  and  floating  point  arithmetic.  The  noise  models  cannot,  in  general,  be  verified 
theoretically,  so  that  experiments  must  be  used  to  test  their  validity. 

We  should  mention  that  there  is  an  alternative  to  using  statistical  roundoff  error  models; 
that  is,  one  can  derive  deterministic  bounds  on  the  noise  at  the  output  of  a  filter.  The  chief  dif¬ 
ficulty  with  such  bounds  is  that  they  are  usually  quite  pessimistic,  in  comparison  with  the  results 
of  experiments.  Our  emphasis  will  be  on  the  statistical  approach. 

In  this  section,  statistical  roundoff  error  models  for  both  fixed  and  floating  point  computa¬ 
tion  will  be  presented  and  discussed.  Some  experiments  will  be  described,  where  pertinent  sta¬ 
tistics  of  roundoff  noise  sequences  have  been  measured,  with  a  view  toward  testing  the  models. 

In  general,  the  results  are  in  accord  with  those  of  previous  workers  (see  Sec.  I-B-3). 
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Before  proceeding  to  a  discussion  of  the  models,  let  us  emphasize  that  roundoff  in  digital 
filtering  is  unavoidable,  and  cannot  be  prevented  by  increasing  the  register  length  by  some  num¬ 
ber  of  bits.  To  illustrate  this,  consider  the  simple  first-order  recursion 

y(nT)  =  ay(nT  -  T)  +  x(nT)  .  (1-4) 

Consider  fixed  point  computation,  and  assume  that  a  and  y(nT  —  T)  each  contain  t  bits.  Then 
their  product  will  contain  2t  bits.  If  the  length,  in  bits,  of  this  product  is  not  reduced  by  round¬ 
ing,  it  will  grow  by  t  bits  for  each  successive  iteration. 

1.  Fixed  Point  Roundoff  Errors 

In  fixed  point  arithmetic,  rounding  errors  need  occur  only  when  multiplications  are  per¬ 
formed.  Fixed  point  additions  are  error-free  provided  no  overflow  occurs. 

To  discuss  quantitatively  the  errors  due  to  roundoff  of  products,  we  need  to  define  a  conven¬ 
tion  for  the  representation  of  numbers  in  fixed  point.  We  shall  assume  a  word  length  t  +  1  bits, 
and  shall  consider  our  numbers  to  be  fixed  point  fractions,  consisting  of  t  bits  plus  a  sign,  with 
the  binary  point  just  to  the  right  of  the  sign  bit.  Thus,  for  a  positive  number,  a  one  in  the  least- 
significant  bit  represents  a  numerical  value  of  2  *.  This  convention  will  be  used  throughout  this 
report.  It  implies  no  loss  of  generality,  but  merely  assigns  a  scale  of  units  to  our  fixed  point 
numbers.  The  unit  2  1  will  be  referred  to  as  the  width  of  quantization. 

When  two  of  these  fixed  point  fractions,  for  example,  the  constant  a  and  the  variable 
y(nT  —  T)  in  (1-4),  are  multiplied  together,  the  product  will  generally  contain  2t  bits  plus  a  sign. 
But  only  t  bits  plus  sign  can  be  retained  for  use  in  future  computation.  The  range  of  values 
which  the  resulting  error  can  assume  depends  on  exactly  how  the  product  is  cut  to  single  preci¬ 
sion.  If  the  product  is  rounded  to  the  nearest  single  precision  fraction,  then  the  error  will  lie 
in  the  interval  [—  ( 1  / 2 )  2  *,  (1/2)  2  *" ).  But  suppose  that  we  truncate  the  double  length  product 
by  simply  dropping  the  lower  order  bits.  Then,  when  the  product  is  positive,  the  rounding  er¬ 
ror  will  lie  in  (—  2~\  0);  and,  when  the  product  is  negative,  the  error  will  lie  in  (0,  2  *).  Thus, 
for  truncation,  the  error  is  signal  dependent  in  that  its  sign  depends  on  the  sign  of  the  resulting 
product. 

For  the  most  part,  our  analyses  of  roundoff  noise  will  be  based  on  rounding,  rather  than 
truncation,  basically  because  it  is  analytically  convenient  to  be  able  to  make  the  assumption  that 
the  rounding  noise  is  uncorrelated  with  the  signal,  but  also  because  rounding  generally  causes 
less  output  noise  than  truncation.  However,  truncation  can  usually  be  implemented  more  simply 
and  in  less  time,  so  that  we  shall  often  discuss  this  case  and  report  on  experiments  measuring 
truncation  noise. 

Focusing  our  attention  on  rounding  rather  than  truncation,  let  us  mention  two  further  points. 
First,  the  rounding  process  is  not  completely  defined  by  saying  that  we  round  to  the  nearest 
quantization  level,  for  we  must  define  what  to  do  when  our  double  precision  product  lies  exactly 
between  two  levels.  This  question  arises,  for  example,  in  the  case  where  we  are  multiplying 
our  signal  by  the  constant  l/2,  for  here  our  result  will  lie  exactly  between  two  quantization  lev¬ 
els  approximately  half  of  the  time.  To  be  consistent  with  our  assumption  that  rounding  errors 
are  signal  independent,  we  shall  assume  that  a  random  choice  is  made  as  to  which  of  the  equi¬ 
distant  quantization  levels  we  shall  choose.  As  we  shall  see  later,  there  are  situations  where 
the  mean-squared  output  noise  in  a  digital  filter  (specifically,  one  involving  FFT  computation) 
will  differ  significantly,  depending  on  whether,  in  this  midway  situation,  we  round  up  or  down 
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at  random  or  choose,  for  example,  to  always  round  up.  The  second  point  is  that  we  need  not 
assume  that  the  constant  coefficients  in  the  system,  and  the  signals  being  processed,  are  re¬ 
tained  to  the  same  precision.  For  example,  in  a  practical  hardware  implementation  of  a  filter, 
we  might  want  to  use  12-bit  coefficients  and  16-bit  data,  rounding  our  28-bit  products  to  16  bits. 

Now  let  us  define  the  statistical  model  we  shall  use  for  fixed  point  rounding  errors.  With 
our  width  of  quantization  taken  as  2  *,  we  assume  the  rounding  errors  to  have  a  probability  den¬ 
sity  function  which  is  uniform  in  the  interval  [ —  ( l/ 2 )  2  *,  (l/2)  2  *]-  Thus,  the  variance  of  the 
rounding  errors  can  easily  be  derived  to  be  ( 1  / 12)  2  .  The  next  key  assumption  we  make  is 

that  the  rounding  error  sequences  are  uncorrelated  from  sample  to  sample,  that  is,  they  repre¬ 
sent  digital  white  noise  sources.  Next,  we  assume  that  the  roundoff  noise  is  uncorrelated  witli 
the  signal.  Finally,  we  shall  generally  be  considering  computations  where  many  multipliers  are 
present  and  must  be  associated  with  a  roundoff  noise  source;  we  shall  assume  all  these  noise 
sources  to  be  mutually  uncorrelated. 

The  basic  model  for  rounding  errors  just  defined  will  also  be  assumed  to  apply  to  noise  due 
to  A-D  conversion.  In  addition,  we  shall  assume  that  the  A-D  noise  is  uncorrelated  with  the 
roundoff  noise. 

Previous  studies  of  these  noise  models,  both  theoretical  and  experimental,  have  been  dis¬ 
cussed  in  Sec.  I-B-3.  Some  experiments,  testing  directly  some  of  the  assumptions  of  the  mod¬ 
els,  will  be  reported  below  in  Sec.  I-C-3. 

2.  Floating  Point  Roundoff  Errors 

Floating  point  roundoff  errors  differ  from  fixed  point  roundoff  errors  in  two  key  ways.  First, 
the  expected  magnitude  of  a  floating  point  roundoff  error  depends  on  the  magnitude  of  the  signal, 
since  the  numerical  value  of  the  last  bit  in  the  mantissa  is  determined  by  the  exponent  of  the  sig¬ 
nal.  Second,  not  only  the  results  of  multiplications,  but  also  the  results  of  additions  must  be 
rounded  in  floating  point. 

The  following  conventions  will  be  used  with  respect  to  floating  point  computation.  We  as¬ 
sume  a  binary  representation  of  floating  point  numbers  in  the  form  of  (sign)  -  a-  2^,  where  a 
and  b  each  have  a  fixed  number  of  bits.  The  number  b,  called  the  exponent,  is  an  integer;  the 
number  a,  called  the  mantissa,  is  a  normalized  fraction  lying  in  the  range  i/2  to  1.  It  will  gen¬ 
erally  be  assumed  that  enough  bits  are  allotted  to  the  exponent  b  so  that  no  computed  number 
will  lie  outside  the  permissible  range.  The  notation  ££(•  )  will  denote  the  machine  number  result¬ 
ing  from  the  arithmetic  operations  specified  by  the  parentheses.  In  general,  the  order  in  which 
the  operations  indicated  inside  the  parentheses  are  carried  out  needs  to  be  specified. 

Suppose  two  floating  point  machine  numbers  x  and  y,  each  with  a  t-bit  mantissa,  arc  mul¬ 
tiplied.  Then,  the  exact  product  will  have  a  2t-bit  mantissa,  of  which  only  the  most-significant 
t  bits  can  be  retained.  Depending  on  whether  the  mantissa  is  truncated  or  rounded,  the  sign  of 
the  error  will  or  will  not  depend  on  the  sign  of  the  product  xy.  But,  in  either  case,  the  numer¬ 
ical  value  of  the  error  will  be  proportional  to  2^x^,  where  b(xy)  is  the  exponent  of  the  product. 
Essentially,  this  means  that  the  error  is  proportional  to  the  product,  a  fact  which  can  be  sum- 


•  ^  17 
marized  as 

f<(xy)  =  xy(l  +  e) 

(I— 5a) 

where 

|  e  |  <  2-t 

(I— 5b) 
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When  two  floating  point  machine  numbers  x  and  y  are  added,  the  mantissa  of  the  exact  sum 
will,  in  general,  have  more  than  t  bits,  even  when  the  exponents  of  x  and  y  are  equal.  The 
inexact  computation  can  be  expressed  as 


ff(x  +  y)  =  (x  +  y)  (1  +  6 ) 


(I-6a) 


where 


6  <  2 


-t 


(I-6b) 


We  now  introduce  a  statistical  model  for  the  roundoff  errors  of  floating  point  multiplication 
and  addition.  As  in  the  fixed  point  case,  the  assumptions  we  make  cannot  be  proven  theoretically, 
and,  in  fact,  can  be  demonstrated  to  be  false  for  specific  cases.  Furthermore,  our  model  for 
floating  point  computation  cannot  be  derived  from  our  model  for  fixed  point  computation.  In  fact, 
arguments  can  be  constructed  showing  the  two  models  to  be  inconsistent.  However,  there  are 
two  important  justifications  for  the  floating  point  roundoff  model  wc  shall  use.  First,  the  model 
is  analytically  convenient  for  use  in  predicting  output  noise  in  recursive  filters  and  FFT's.  Sec¬ 
ond,  the  predictions  of  the  model  have  been  found  to  be  in  reasonable  accord  with  experimental 
noise  measurements. 

Our  model,  which  applies  to  the  case  of  roundingt  rather  than  truncation,  is  as  follows. 

We  assume  the  roundoff  variables  e  and  6  in  (1-5)  and  (1-6)  to  be  independent  of  the  signal  (the 
result  of  the  multiplication  or  addition)  and  to  have  zero  mean.  Furthermore,  where  the  mul¬ 
tiplications  and  additions  are  used  in  iterative  schemes,'  we  assume  these  roundoff  variables  to 
be  independent  from  iteration  to  iteration.  We  also  assume  that  the  roundoff  variables  associ¬ 
ated  with  all  the  various  multiplications  and  additions  in  a  computation  are  independent  of  each 
other. 

It  remains  to  specify  the  first-order  probability  density  function  (pdf)  of  our  roundoff  vari¬ 
ables.  By  analogy  with  our  fixed  point  model,  and  in  accordance  with  (I-5b)  and  ( I- 6b ),  we  might 
be  led  to  assume  that  e  and  6  are  each  uniformly  distributed  in  ( —  2  2  S  with  variances 

a2  a  2  =  (1/3)  2  2t.  Experiments  have  shown  that  the  actual  pdf's  are  not  quite  uniform,  and 
^  u  —  2t 

that  the  variances  are  slightly  less  than  (1/3)  2  .  Also,  the  details  of  the  pdf  and  exact  value 

of  the  variances  have  been  found  to  vary  slightly  with  the  signal  being  processed  and  the  compu- 

2  2  -2t 

tation  being  performed.  However,  the  assumption  that  and  are  proportional  to  2  has 

proved  valid.  Since  our  analysis  of  noise  in  digital  filters  generally  involves  only  the  variances 
2  2 

cr ^  and  a ^  ,  and  not  pdf's,  our  general  approach  will  be  to  leave  these  variances  as  variables 
and  use  empirical  numbers  for  them  only  when  comparison  of  theory  with  experiment  is  to  be 
carried  out. 

2  2  -2t 

Typical  empirical  values  for  the  roundoff  variable  variances  are  a  ^  «  0. 23  X  2  .  We 

should  note  that,  for  practical  purposes,  only  rough  estimates  of  these  variances  are  needed, 
since  even  a  50-percent  error  in  the  variance  represents  less  than  one  bit  of  difference  in  the 
predicted  rms  output  noise. 

One  final  point  should  be  made  about  floating  point  roundoff  models;  that  is,  in  order  to  per¬ 
form  a  statistical  analysis  of  noise  in  a  floating  point  digital  filter,  one  must  assume  a  statisti¬ 
cal  model  for  the  signal,  as  well  as  for  the  roundoff  variables.  To  see  this,  consider  the  triv¬ 
ial  digital  filter 


f  As  in  the  fixed  point  cose,  we  ossume  that  when  o  result  lies  equolly  between  two  quantization  levels,  o  ran¬ 
dom  choice  is  mode  os  to  whether  to  round  up  or  down.  This  midwoy  situotion  occurs  frequently  in  floating  point 
addition. 
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y(nT )  ax(nT) 


(1-7) 


where  the  output  sequence  is  just  a  constant  times  the  input  sequence.  Floating  point  computa¬ 
tion  would  yield 

ft  [y(nT)]  -  ax(nT)  [1  +  e(n)]  (1-8) 

with  an  error 

e(n)  -  ft  [y(nT)]  -  y(nT)  =  c(n)x(nT)  (1-9) 

which  is  proportional  to  the  signal  x(nT).  Given  that  we  model  x(nT)  as  a  stationary  random 

2 

process  with  variance  a  ,  we  can  find  the  error  variance  as 


2 

<j 


e 


2  2 


a  a 


e  x 


(1-10) 


3.  Experimental  Verification  of  Roundoff  Noise  Models 

Some  experiments  directed  toward  verifying  the  simple  white  noise  model  for  roundoff  er¬ 
rors  have  been  carried  out.  The  direct  tests  on  the  roundoff  error  models,  to  be  described  be¬ 
low,  were  carried  out  for  the  fixed  point  case.  Experiments  testing  the  models'  predictions  of 
output  noise  in  digital  filters,  for  both  the  floating  and  fixed  point  cases,  will  be  described  later. 

Sample  sequences  of  roundoff  errors  were  obtained  by  multiplying  signals  by  fixed  coeffi¬ 
cients,  and  saving  the  errors  caused  by  rounding  the  results  to  single  precision.  The  two  types 
of  signals  used  were  sinusoidal  signals  generated  by  recursive  difference  equations  with  poles 
on  the  unit  circle,  and  white  noise  sequences  obtained  from  a  random  number  generator.  For 
these  (and  other)  noise  measurements,  special  signals  for  which  the  model  is  clearly  invalid 
needed  to  be  avoided.  An  example  of  such  a  signal  is  a  sinusoid  generated  from  a  tabic  lookup, 
where  the  sampling  frequency  is  exactly  divisible  by  the  sine  wave  frequency.  Here,  the  signal 
samples-  will  repeat  exactly  each  period.  Thus,  the  roundoff  error  samples,  obtained  by  multi¬ 
plying  the  periodic  signal  by  a  fixed  coefficient  and  rounding  the  results,  will  repeat  with  the 
same  period,  clearly  contradicting  the  assumption  of  white  roundoff  noise. 

For  the  first  set  of  experiments,  attention  was  directed  to  the  case  of  fine  quantization.  The 
two  types  of  signals  used  were  generated  as  sequences  of  18-bit  words,  were  multiplied  by  18-bit 
coefficients,  and  the  results  rounded  to  18  bits. 

First-order  statistics  (mean,  variance,  probability  density  function)  of  the  error  sequences 
were  estimated  experimentally.  In  all  cases,  as  more  data  points  were  used,  the  mean  converged 
quickly  toward  zero,  and  the  variance  approached  the  predicted  value  of  2  V12-  Experimental 

histograms  were  taken  to  check  the  assumption  of  uniform  first-order  pdf.  As  more  and  more 
data  were  used,  the  results  converged  toward  the  assumed  uniform  density  function. 

The  next  step  was  to  test  the  important  assumption  that  the  roundoff  error  sequences  are 
uncorrelated  from  sample  to  sample.  To  make  this  investigation  quantitative,  a  technique  pre¬ 
sented  in  Grenander  and  Rosenblatt  (Ref.  18,  pp.  97-101)  was  used  to  obtain  theoretically  the 
probability  density  function  for  the  first  correlation  coefficient  (the  autocorrelation  function  for 
a  lag  of  one  sample,  divided  by  the  autocorrelation  for  a  lag  of  zero),  for  the  case  of  white  se¬ 
quence.  The  approximate  result  was  that  for  a  white  sequence  of  N  samples,  the  first  correla¬ 
tion  coefficient  p^  would  have  a  normal  distribution  with  mean  of  —  (l/N),  and  variance  l/N.  By 
using  this  result,  a  significance  test  was  performed  on  measured  values  of  for  roundoff  error 
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sequences.  We  tested  3000  error  sequences  of  length  N  =  512,  obtained  from  rounding  of  sinus¬ 
oidal  signals  of  500  different  frequencies,  multiplied  by  6  different  coefficients.  At  the  1 -percent 
significance  level,  1.1  percent  of  the  measured  values  of  failed  the  significance  test,  and  at 
the  5-percent  significance  level,  5.83  percent  of  the  tests  failed.  These  results  arc  in  quite 
reasonable  accord  with  the  white  noise  model  for  roundoff  error.  In  addition  to  checking  when 
significance  levels  were  exceeded,  an  experimental  histogram  was  obtained  for  the  3000  meas¬ 
ured  correlation  coefficients.  The  distribution  followed  reasonably  well  the  predicted  normal 
distribution. 

Similar  experiments  were  then  performed  for  coarser  quantization,  that  is,  rounding  was 
done  to  fewer  and  fewer  bits,  and  correlation  coefficients  of  error  sequences  were  measured. 

For  roundoff  of  the  sinusoidal  signals  used,  even  down  to  4  bits,  the  assumption  of  uncorrelatcd 
samples  was  not  contradicted  for  the  significance  test  just  discussed.  Measurements  of  the  sec¬ 
ond,  third,  fourth  and  fifth  correlation  coefficients  also  tended  to  agree  with  the  assumption  of 
uncorrelated  samples.  These  results  are  in  accord  with  experiments  reported  by  Sornmoonpin.1^ 

Finally,  spectral  measurements  were  performed  on  roundoff  error  sequences  obtained  by 
multiplying  18-bit  sine  waves  by  18-bit  coefficients  and  rounding  the  results  to  18  bits.  The  most 
successful  measurement  technique  used  was  to  section  the  noise  sequence  into  sections  2048 
points  long,  compute  periodograms  of  the  individual  sections  via  the  FFT,  smooth  each  periodo- 
gram  via  a  moving  average  smoothing,  and  then  average  the  results  over  all  the  smoothed  period¬ 
ograms,  We  found  that  if  enough  data  were  used,  the  measured  spectra  would  begin  to  look  white, 
but  that  the  amount  of  data  needed  was  quite  large.  In  one  example,  a  smoothing  window  of  spec¬ 
tral  width  l/l6  the  sampling  frequency  was  used  on  each  periodogram,  so  that  actually  only  16 
independent  points  on  the  spectrum  were  obtained.  Still,  24  sections,  or  about  48,000  data  points, 
had  to  be  used  before  the  mean-squared  deviation  of  the  spectrum  from  its  average  was  less  than 
5  percent  of  the  average.  Such  results  are  actually  in  reasonable  accord  with  the  kind  of  con¬ 
vergence  predicted  by  spectral  estimation  theory.  Since  spectral  measurements  on  white  noise 
should  converge  faster  than  measurements  on  colored  noise,  the  implication  is  that  in  order  to 
make  accurate  spectral  measurements  on  output  noise  in  digital  filters,  one  would  have  to  find 
ways  to  obtain  very  long  noise  sequences  and  be  willing  to  expend  much  computation  time  for  their 
analys  is. 

D.  SUMMARY 

Before  proceeding  to  the  body  of  the  report,  let  us  now  briefly  outline  the  order  of  topics  to 
be  studied. 

Section  II  will  concentrate  on  quantization  effects  in  simple  (first-  and  second-order)  recur¬ 
sive  filters.  The  analysis  of  coefficient  quantization  will  be  reviewed  for  the  fixed  point  case, 
and  carried  over  to  the  floating  point  case.  The  work  of  others  analyzing  roundoff  noise  for  the 
fixed  point  case  will  be  reviewed  and  considered  jointly  with  the  overflow  constraint  to  obtain  out¬ 
put  noise-to-signal  ratios.  The  use  of  a  clamp  to  counteract  the  problem  of  overflow  in  fixed 
point  recursions  will  be  investigated.  Noise-to-signal  ratios  for  the  block  floating  and  floating 
point  cases  will  be  derived,  and  experimental  verification  presented.  A  comparison  of  the  three 
different  types  of  arithmetic,  on  the  basis  of  differing  noise-to-signal  ratios,  will  be  presented 
for  these  first-  and  second-order  recursive  filters. 

In  Sec.  Ill,  quantization  effects  will  be  studied  for  the  other  basic  algorithm  of  digital  filter¬ 
ing—  the  FFT.  The  effect  of  coefficient  quantization  will  be  studied  both  theoretically  and 


13 


experimentally  for  the  fixed  and  floating  point  eases.  With  respeet  to  roundoff  errors,  noise 
to-signal  ratio  analyses  for  fixed,  bloek  floating,  and  floating  point  will  be  derived,  and  exper¬ 
imental  results  will  be  presented  for  all  three  types  of  arithmetic.  A  comparison  of  quantization 
effeets  in  the  three  types  of  arithmetie  will  be  presented. 

Section  IV  is  devoted  to  frequency  sampling  filters.  The  coefficient  accuracy  problem  is 
investigated  in  detail  for  the  fixed  point  case,  and  experimental  results  are  presented  for  a  spe- 
eifie  filter  example.  Floating  point  eoeffieient  errors  are  discussed.  Seetion  Ill's  fixed  point 
noise  analysis  of  digital  resonators  is  applied  to  derive  noise-to-signal  ratios  for  the  frequency 
sampling  filter.  Alternate  realizations  for  the  digital  resonators  are  compared.  On  the  basis 
of  the  coefficient  and  roundoff  noise  analysis,  and  the  overflow  constraint,,  register  length  re¬ 
quirements  for  the  realization  of  a  fixed  point  frequency  sampling  filter  are  set  forth.  Compar¬ 
ison  with  the  floating  point  ease  is  diseussed. 

In  Sec.  V,  quantization  effeets  in  FFT  filters  are  analyzed,  and  a  comparison  with  alternate 
realizations  of  the  same  filter  via  frequency  sampling  is  included.  Applying  the  results  of  See.  Ill, 
we  examine  the  effect  of  coefficient  errors  in  an  FFT  filter.  Experimental  measurements  of  the 
aetual  error  in  the  filter's  frequency  response  due  to  eoeffieient  quantization  are  reported.  An 
upper  bound  on  the  output  of  a  eireular  convolution  is  derived,  whieh  assists  one  in  arranging 
sealing  of  the  array  to  prevent  overflow  in  a  fixed  point  FFT  filter.  For  the  fixed  point  case,  the 
output  noise-to-signal  ratio  is  analyzed  and  the  results  applied  specifically  to  the  same  filter  ex¬ 
ample  considered  in  See.  IV  in  the  frequency  sampling  form.  A  comparison  of  the  FFT  vs  the 
frequency  sampling  realization  is  diseussed. 

Finally,  in  See.  VI,  the  results  of  the  report  are  diseussed,  and  some  problems  for  further 
study  are  proposed. 
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II.  SIMPLE  RECURSIVE  FILTERS 


This  section  will  be  devoted  to  a  study  of  quantization  effects  in  first-  and  second-order 
recursive  digital  filters.  We  study  these  simple  recursions  in  much  detail  because  they  are  the 
basic  building  blocks  from  which  complicated  recursive  digital  filters  should  be  constructed. 

First,  we  shall  investigate  the  coefficient  quantization  problem,  then  consider  jointly  the  prob¬ 
lems  of  roundoff  noise  and  overflow  in  order  to  predict  output  noise-to-signal  ratios  for  these 
simple  filters.  Fixed,  block  floating,  and  floating  point  arithmetic  all  will  be  considered  and 
compared  on  the  basis  of  differing  quantization  effects. 

A.  COEFFICIENT  QUANTIZATION-  FIXED  POINT 

Here  we  review  the  analysis  of  the  effects  of  coefficient  quantization  in  simple  recursive  fil¬ 
ters.  The  analysis  is  quite  similar  to  that  in  Refs.  2  and  9.  Attention  will  be  mainly  directed 
toward  the  effect  of  coefficient  quantization  on  the  pole  positions  of  the  filters. 

For  a  first-order  difference  equation 

y(nT)  =  ay(nT  —  T)  +  x(nT)  (11-1) 

there  is  one  coefficient  a,  and  the  pole  position  is  equal  to  the  value  of  this  coefficient.  Thus, 
if  rounding  causes  the  filter  to  be  realized  with  a  +  Aa,  then  the  error  in  the  pole  position  is  Aa. 
For  fixed  point  coefficients  of  t  +  1  bits,  the  pole  positions  which  can  be  achieved  are  spaced 
uniformly  along  the  real  axis  in  the  z-plane  with  separation  2  *.  The  error  in  pole  position  is 
bounded  by  \Z  ,  the  maximum  error  in  a  fixed  point  coefficient. 

The  situation  is  more  complicated  for  a  second-order  equation  such  as 

y(nT )  =  a1y(nT  -  T)  +  a2y(nT  -  2T)  +  x(nT)  (II-2) 

which  will  be  assumed  to  have  complex  conjugate  poles  at  r  e*-'0,  where  the  relationship  between 
coefficients  and  pole  positions  is  specified  by 

a1  =  2r  cos  Q  and  a2  =  —  r^  (11-3) 

or  equivalently 

r  =  J—  a2  and  0  =  cos  *  (a^/Zj—  a.,)  .  ( II-4 ) 

Assuming  small  quantization  errors,  we  can  approximate  the  sensitivity  of  pole  positions  to  co¬ 
efficient  errors  by  differentiating  r  and  0  with  respect  to  a^  and  a2-  The  result  is 

Ar  a-Aa2/2r  ,  A0  «  —  A  a.,/2r  tan  0  — Aa^/2r  sin0  .  (11-5) 

The  error  in  radius,  for  the  interesting  case  of  r  near  unity,  is  not  very  different  from  the  case 
of  the  first-order  equation;  but  the  error  in  0,  for  0  near  0,  can  be  quite  large.  This  can  be 
quite  troublesome,  for  example,  in  low-pass  frequency  sampling  filters,  which  require  accurately 
tuned  resonators  with  resonant  frequencies  which  are  a  small  fraction  of  the  sampling  rate. 

To  see  this  problem  from  another  viewpoint,  consider  that  the  quantization  of  a^  and  a2  de¬ 
fines  a  grid  of  allowable  pole  positions  in  the  z-plane.t  From  (11-3)  one  can  deduce  that  the  grid 
is  defined  by  intersections  of  uniformly  spaced  straight  lines  parallel  to  the  imaginary  axis,  with 

t  J.  Evans,  private  communication. 
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o  REALIZABLE  POLE  POSITIONS 


UNIT  CIRCLE 


Fig.  1.  Grid  of  ollowoble  pole  positions  — 
direct  form. 


Fig.  2.  Grid  of  ollowoble  pole  positions  - 
coupled  form. 
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circles  centered  on  the  origin  and  becoming  more  closely  spaced  as  r  approaches  1.  Such  a 
grid  is  pictured  in  Fig.  1.  For  small  6  and  r  near  1,  we  can  see  that  the  circles  and  straight 
lines  tend  to  become  parallel,  and  quantization  of  the  angles  of  the  poles  becomes  very  coarse. 

A  second-order  transfer  function  in  which  the  pole  positions  are  less  sensitive  to  parameter 
errors  can  be  realized  via  the  pair  of  coupled  first-order  equations, 

y1(nT)  =  a1y1(nT  -  T)  -  a2y2(nT  -  T)  +  b1x(nT) 

y2(nT)  =  a2yi(nT-T)  +  aiy2(nT-T)  +  b2x(nT)  .  (II-6) 

The  relationship  between  coefficients  and  pole  positions  is  given  by 

a.  =  r  cos  0  ,  a2  =  r  sin0  .  (II-7 ) 


Expressions  for  approximate  errors  in  pole  positions  can  be  obtained  as 
Ar  «  (cos  0)  Aa1  +  (s  in  0 )  A  a.. 


.  _  _  ,cos  0,  .  ,sin0.  . 
AG  «=  (— —  )  Aa2  -  (— )ASl 


(II  — 8 ) 


These  errors  in  pole  positions  are  quite  different  than  those  given  in  (II- 5 ),  and  for  0  small, 
the  error  in  0  is  substantially  less  for  the  realization  of  (II-6)  than  for  (II- 2 ).  Of  course,  (II-6 ) 
requires  substantially  more  computation  than  does  (II- 2 ),  and  thus  should  be  resorted  to  only 
when  quantization  of  parameters  is  a  problem.  (Actually,  the  coupled  form,  for  certain  cases, 
yields  an  improved  noise- to-signal  ratio  over  the  direct  form,  and  might  be  resorted  to  for  this 
reason  also.)  We  shall  see  later  (in  Sec.  IV)  that  for  the  case  of  a  low-pass  frequency  sampling 
filter,  coupled  form  resonators  yield  distinct  improvements  over  the  performance  obtained  using 
direct  form  resonators. 

It  is  interesting  to  examine  the  grid  of  allowable  pole  positions  for  the  coupled  form.  One 
can  deduce  from  (II-7)  that  this  is  a  uniform  rectangular  grid  as  depicted  in  Fig.  2.  In  contrast 
to  Fig.  1,  quantization  of  the  angles  of  the  poles  does  not  become  especially  coarse  for  small  0, 
but  is  essentially  equally  fine  for  all  0. 


B.  COEFFICIENT  QUANTIZATION  -  FLOATING  POINT 

The  equations  of  the  preceding  section  for  the  relationships  between  coefficients  and  pole 
positions,  and  for  the  sensitivity  of  pole  position  to  small  changes  in  the  coefficients,  are  valid 
for  the  floating  point  case  as  well  as  for  the  fixed  point  case.  The  only  distinction  is  that  the 
actual  floating  point  coefficients  are  not  uniformly  quantized  over  their  full  range  of  values. 

This  fact  can  be  included  approximately  in  the  sensitivity  formulas  (II- 5 )  and  (II-8)  by  expressing 
the  errors  in  the  coefficients  as  percentage  errors,  proportional  to  the  coefficients  themselves. 
Additional  insight  can  be  gained  by  considering  the  structure  of  the  grid  of  allowable  pole  posi¬ 
tions,  for  the  floating  point  case. 

For  a  floating  point  first-order  filter  as  in  (II—  1 ),  the  interesting  range  of  parameters  is 
generally  restricted  to  the  range  |  <  a  <  1.  For  this  range,  the  allowable  pole  positions  are  uni¬ 
formly  spaced  along  the  real  axis,  as  in  the  fixed  point  case.  The  spacing  for  the  floating  point 
case  is  2  \  where  t  now  represents  the  number  of  bits  in  the  mantissa. 

In  the  next  few  paragraphs,  our  comparisons  between  floating  and  fixed  point  will  be  based 
on  the  implied  assumption  that  the  number  of  bits  in  the  floating  point  mantissa  equals  the  number 
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of  bits  (not  including  sign)  in  the  fixed  point  word.  It  should  be  pointed  out  that  this  implies  a 
longer  total  register  length  for  floating  point,  due  to  the  extra  bits  needed  for  the  exponent. 
However,  our  intention  is  to  indicate  qualitatively  the  differences  in  how  a  roundoff  error  affects 
pole  position  in  the  two  types  of  arithmetic,  and  a  quantitative  comparison  on  the  basis  of  equal 
register  lengths  is  not  attempted. 

Let  us  consider  the  direct  form  second-order  filter  (11-2).  We  assume  roundoff  errors  in 
the  coefficients  of  the  form 

iarVi  •  Aa2=f2a2  (II'9) 

where 

1  I  <  2-t  ,  |  <" 2 1  <  2_t  .  (11-10) 

Substituting  (11-9)  and  (11-3)  in  (11-5),  we  obtain  sensitivity  equations  for  the  floating  point  case 
as 

Ar  «  c  (r/2)  ,  AG  a  f  (r/2  tanO)  -  ^/tanO  .  (11-11) 

For  the  interesting  case  of  r  near  unity,  the  sensitivity  of  the  pole  radius  is  essentially  the  same 
as  for  fixed  point.  If,  in  addition,  0  is  small,  the  error  AO  is  proportional  to  l/O  as  in  the  fixed 
point  case.  Thus,  for  high-gain  filters  where  the  ratio  of  resonant  frequency  to  sampling  rate 
is  fairly  low,  the  coefficient  quantization  problems  for  floating  and  fixed  point  are  essentially  the 
same.  For  filters  where  0  approaches  jr/2,  AO  becomes  very  much  smaller  than  in  the  fixed 
point  case  since  the  coefficient  =  2r  cos  O  becomes  small,  causing  errors  in  this  coefficient 
to  become  proportionately  small. 

As  in  the  fixed  point  case,  the  grid  of  allowable  pole  positions  for  the  direct  form  is  defined 
by  intersections  of  straight  lines  parallel  to  the  imaginary  axis,  with  circles  centered  on  the  or¬ 
igin.  When  r  is  near  unity,  the  relative  spacing  of  the  circles  is  essentially  the  same  as  in  the 
fixed  point  case  depicted  in  Fig.  1.  For  r  cos  O  =  Re  z  in  the  range  l/2  to  1,  corresponding  to 
the  case  of  reasonably  small  0,  the  relative  spacing  of  the  straight  lines  is  also  essentially  the 
same  as  in  the  fixed  point  case.  For  r  cos  0  much  less  than  l/2  (that  is,  where  0  is  near  ir/Z), 
the  straight  lines  are  spaced  much  more  closely  than  in  the  fixed  point  case. 

Now  let  us  consider  the  coupled  form  resonator  of  (11-6).  Substituting  (11-9)  and  (11-7)  in 
( II— 8 ),  we  obtain  the  sensitivity  formulas 

Ar  ~  f,r  cos^G  +  c,r  sin^G  (ll-12a) 

1  Z 

AG  «  (f2-  fi)  (2H^®)  .  (II- 4  2b) 

The  error  in  r  is  slightly,  but  not  significantly,  different  from  the  fixed  point  case.  The  sen¬ 
sitivity  in  G  is  never  greater  than  in  the  fixed  point  case,  and  for  Q  near  0  or  tt/ 2,  the  factor 
sin2G  in  (ll-12b)  implies  that  the  error  in  G  will  be  much  smaller  than  in  the  fixed  point  case. 

The  grid  of  allowable  pole  positions,  as  in  the  fixed  point  case,  is  rectangular,  but  the  lines 
are  not  uniformly  spaced.  For  the  area  of  the  z-plane  (considering  only  the  first  quadrant),  where 

-=■  <  Re  z  =  a,  =  r  cos  G  <  1 

2  1 

■|<Imz  =  a2=r  sinQ  <  1  (11-13) 
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the  lines  are  spaced  uniformly  with  separation  2  *,  as  in  fixed  point.  But  for  small  values  of 
He  z  or  Im  z,  corresponding  to  small  values  of  the  coefficients  a^  or  a^,  and  to  resonant  angles 
near  0  or  jt/2,  the  grid  becomes  much  finer.  As  indicated  in  (II- 9),  when  the  coefficient  a1  or 
a^  becomes  small,  the  error  due  to  quantizing  this  coefficient  becomes  proportionately  small. 

C.  ROUNDOFF  NOISE  AND  OVERFLOW  CONSTRAINT  -  FIXED  POINT 

In  this  section,  we  analyze  the  effects  of  roundoff  noise  in  first-  and  second-order  filters 
implemented  using  fixed  point  arithmetic.  The  derivations  of  output  noise  variance  arc  a  review 
of  the  work  of  others.  Experimental  results  reported  by  these  previous  workers  have  gen¬ 

erally  supported  the  theoretical  results  for  output  noise  variance.  We  shall  discuss  how  the 
overflow  constraint,  considered  jointly  with  these  noise  formulas,  determines  output  noise-to- 
signal  ratio,  and  shall  derive  formulas  for  output  noise-to-s ignal  ratio,  for  the  case  of  a  flat 
signal  spectrum.  An  alternative  to  complete  prevention  of  overflow,  the  insertion  of  a  saturat¬ 
ing  device  or  clamp  in  the  feedback  loop,  will  be  discussed.  Some  experimental  observations 
of  the  distortions  introduced  by  such  a  device  are  reported. 

Although  we  shall  not  explicitly  analyze  the  effects  of  A-D  noise,  it  should  be  apparent  that 
this  noise  could  be  easily  included,  using  the  same  approach  as  for  roundoff  noise.  However, 
our  focus  of  attention  is  on  the  roundoff  noise,  since  this  is  the  noise  introduced  explicitly  by  the 
digital  computation.  With  this  viewpoint,  we  can  compare  the  accuracies  of  various  algorithms 
which  might  be  used  to  process  signals  which  have  been  already  digitized. 

1.  General  Formulas  for  Noise  Variance 

Our  general  approach  to  the  analysis  of  roundoff  noise  in  fixed  point  digital  filters  is  as  fol¬ 
lows.  We  model  all  the  noisy  multipliers  as  noiseless  multipliers  combined  with  additive  white 
noise  sources.  Since,  according  to  the  model  of  See.  I-C-l,  all  the  noise  sources  arc  independ¬ 
ent  of  each  other  and  of  the  signal,  we  can  determine  mean, 
variance,  or  spectrum  of  the  output  noise  simply  by  adding 
the  effects  of  all  the  individual  noise  sources.  To  deter¬ 
mine  the  effect  of  any  individual  noise  source,  wc  find 
(from  inspection  of  the  circuit  diagram  of  the  filter,  or 
otherwise)  a  transfer  function  between  the  noise  source  and 
the  filter  output.  We  then  use  linear  system  noise  theory 
to  derive  the  statistics  of  the  output  noise  produced  by  this 
particular  noise  source.  Thus,  the  basic  situation  we  must  analyze  is  as  depicted  in  Fig.  3. 
Given  white  noise  e  (nT)  as  input  to  an  arbitrary  transfer  function  H(z),  what  are  the  statistics 
(mean,  variance,  etc.  )  of  the  output  e(nT)? 

We  can  conveniently  examine  this  model  via  the  convolution  sum.  Thus, 

n 

e(nT)  =  £  h(mT)  e(nT  —  mT)  (11-14) 

m=  0 

where  h(mT),  the  inverse  z  transform  of  H(z),  is  the  impulse  response  of  the  filter.  The  input 
noise  e(nT)  is  presumed  to  be  zero  for  n  <  0,  and  the  system  is  initially  at  rest.  The  roundoff 
error  sequence  e(nT)  is  assumed  to  have  zero  mean,  variance 

<r*  =  2"2t/l2  (11-15) 


c(nT)  O  »  H(l)  - «►  e(nT) 


Fig.  3.  White  noise  exciting  a  linear 
digital  filter. 
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and  be  independent  froru  sample  to  sample.  Thus,  e(nT)  is  a  weighted  sum  of  independent  ran¬ 
dom  variables.  Its  mean  is  zero,  and  its  variance  is 

n 

<j2  =  a 2  Y  h2(mT)  .  (11-16) 

e  o  u  '  ' 

m=0 

The  power  spectrum  of  e(nT)  is  simply  the  squared  magnitude  of  the  transfer  function  H(o-ju  1  ). 

For  a  stable  system  with  all  its  poles  inside  the  unit  circle,  the  right  side  of  (11-16)  con¬ 
verges,  and  the  steady-state  variance  of  e(nT)  can  be  obtained  by  letting  n  — We  shall  usu¬ 
ally  work  with  this  steady-state  result. 


a  2  =  o'  2  Y  h2(mT) 

e  o  u 
m-0 


(II-17) 


Often,  instead  of  directly  evaluating  the  sum  in  (11-17),  it  is  simpler  to  use  the  identity 


V  h2(mT)  =  ~  ^H(z)  H(l/z)  z-1dz 
m=0 


(11-18) 


and  employ  the  method  of  residues  to  perform  the  contour  integration. 


2.  Noise -to-Signal  Ratio  -  First-Order  Case 

Let  us  apply  the  technique  just  discussed  to  obtain  the  output  noise  variance  for  a  first-order 
system,  as  represented  in  Fig.  4,  or  by  the  equation 

y(n)  =  ay(n  -  1)  +  f(n)  +  x(n)  .  (11-19) 

In  (11-19),  and  in  the  remainder  of  the  report,  we  assume  for  convenience  that  our  units  of  time 
have  been  chosen  so  that  the  sampling  interval  T  =  1.  This  implies  no  loss  of  generality. 


Fig.  4. 


Noisy  first-order  filter  (fixed  point). 


The  noise  source  t  (n)  enters  the  system  at  the  same  point  as  does  the  input  x(n).  The  sys¬ 
tem  function  H(z)  is  given  by  1  /( 1  —  az  ),  and  h(m)  =  am.  Using  (11-17),  the  steady-state  out¬ 
put  noise  variance  can  be  easily  computed  as 


2 

<7 

e 


2  v  2m 

cr  ,  a 
o  u 

m-  0 


(1  -  a2) 


(Il-20n) 
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For  the  case  of  a  high-gain,  narrowband  filter,  with  a  =  1  —  6,  and  6  small,  (II-20a)  tends 
toward 


2 

a 

e 


(II -2  Ob) 


The  implications  of  these  noise  formulas  are  tricky.  The  mean-squared  value  of  the  noise 
clearly  increases  as  a  approaches  unity.  But  this  increase  in  filter  gain  also  means  that  the 
output  signal  power  will  increase.  For  a  signal  with  flat  spectrum,  the  signal  input  goes  through 
the  same  power  gain  as  does  the  noise  source,  so  that  the  output  signal- to-noise  ratio  would  seem 
not  to  change  as  the  pole  of  the  filter  moves  toward  the  unit  circle.  For  a  low-frequency  signal, 
it  would  seem  that  the  output  signal-to-noise  ratio  would  increase  as  the  pole  moves  toward  z  =  1. 
However,  these  results  are  misleading,  since  they  do  not  take  into  account  the  fact  that  with  a 
finite  word  length,  the  input  signal  must  be  kept  small  enough  so  that  it  does  not  cause  overflow 
in  the  computation.  Thus,  the  obtainable  signal-to-noise  ratio  decreases  as  a  approaches  unity. 

Although  the  formula  (II-20a)  for  output  noise  variance  does  not  depend  on  the  signal  being 
processed,  it  is  clear  that  the  output  noise-to-signal  ratio  in  the  filter  does  depend  on  the  type 
of  input  signal.  However,  we  would  like  to  obtain  quantitative  results  for  noise-to-signal  ratio, 
and  in  order  to  do  so  must  assume  a  specific  model  for  the  signal.  We  choose  to  analyze  the 
situation  where  the  signal  is  white,  that  is  a  sequence  of  zero  mean,  uncorrelated  random  vari¬ 
ables.  This  is  a  sort  of  intermediate  situation,  between  the  extreme  cases  of  a  signal  whose 
spectrum  is  completely  in  the  filter  passband,  and  one  where  spectrum  is  out  of  band.  It  is  a 
case  which  lends  itself  fairly  readily  to  analysis,  and  where  the  results  can  be  compared  easily 
with  results  to  be  derived  later  for  floating  and  block  floating  point  filter  realizations.  The  cau¬ 
tion  must  be  stated  that  for  cases  where  the  signal  spectrum  is  not  broad,  our  results  may  be 
misleading.  To  illustrate  the  dependence  of  noise-to-signal  ratio  on  signal  spectrum,  we  shall 
derive  also,  for  a  first-order  filter,  the  noise-to-signal  ratio  for  sinusoidal  input.  It  should  be 
clear  that  the  techniques  we  use  can  be  straightforwardly  applied  to  obtain  noise-to-signal  ratios 
for  any  given  signal  spectrum. 

Now  let  us  proceed  to  mathematically  apply  the  overflow  constraint  to  our  first-order  filter, 
and,  on  the  basis  of  white  signal  statistics,  derive  a  formula  for  the  mean-squared  output  noise- 
to-signal  ratio.  According  to  our  convention  that  fixed  point  words  are  considered  as  signed  frac¬ 
tions,  the  requirement  for  no  overflow  is  that  no  output  of  the  filter  exceed  unity  in  value.  Thus, 
we  require  that 

|  y(n)  |  <1  ,  all  n  .  (11-21) 

In  order  to  translate  this  constraint  into  a  bound  on  the  input  sequence,  we  observe  that  the  max¬ 
imum  possible  output  value  can  be  expressed  in  terms  of  the  maximum  allowed  input  value  as 


max  (|y(n)|  )  =  max  (|x(n)|  )  ^  |h(n)|  .  (11-22) 

n-0 

From  (11-21)  and  (11-22),  we  deduce  that  to  guarantee  no  overflows,  x(n)  must  be  restricted  to 
the  range 
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-  1 


2  h(n) 
n  0 


x(n) 


2  h(n)  I 
n  0 


(11-23) 


We  shall  assume  x(n)  white  and  uniformly  distributed  between  the  limits  in  (11-23).  Thus,  the  in¬ 
put  signal  variance  for  our  first-order  filter  is 


a  ^ 


(1  -  a) 
2  3 


The  variance  of  the  output  signal  is 
2 


1 _  ^2  _  (1  -  a) 


y  "  (l-a2)  x  3(1  +a) 

and  dividing  ( 1 1  —  2  Oa )  by  (11-25)  we  obtain  the  output  noise-to-signal  ratio  as 

2  ,  2 
cr  3cr 

e  o _ 


2  *2 
a  (1  —  a) 

y 


(11-24) 


(11-25) 


(II  —  26a) 


Writing  a  =  1  —  6  and  substituting  (11-15)  in  (11-25),  we  obtain 


( 11- 26b) 


Thus,  the  mean-squared  output  noise-to-signal  ratio  is  proportional  to  the  reciprocal  of  the 
square  of  the  distance  of  the  pole  from  the  unit  eirele. 

We  should  re-emphasize  that  for  the  case  we  are  considering  (white  input  signal),  the  sig¬ 
nal  and  noise  go  through  the  same  gain,  independent  of  where  the  pole  is.  It  is  the  overflow  con¬ 
straint  restricting  the  input  signal  level  which  causes  noise-to-signal  ratio  to  change  with  filter 
gain. 


2  2 

For  contrast,  we  shall  now  obtain  a  la  for  the  case  of 

e  y 

A  cos(u>Qn  +  <p),  where  <p  is  uniformly  distributed  in  (—ir,ir). 


a  sinusoidal  signal  x(n)  = 
To  satisfy  (11-23),  we  set 


so  that 


Now 


A  =  1  —  a 


2  _  (1  -  a) 
x  2 


a  =  H 

y 


(iw  , 

•  °) 


x  ,  ,  2  , 

1  +  a  —  2a  cos  ui 


2 

<7  =  - y 

y  .  -2 


(1  -  a) 


2(1  +  a  —  2a  cosoi^) 
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and  dividing  this  result  into  (II-20a)  wo  obtain  the  ratio 


2 

a 

_ e_ 

2 

a 

y 


,  ,-2t,  (1  +  a2  —  2a  cos  u  ) 
(2  )  ' _ o_ 

6(1  -  a)2  (1  -  a2) 


(II-26c) 


if  we  let  cj  -*■  0,  so  that  the  signal  goes  through  the  maximum  filter  gain,  this  approaches 


<ry2  6(1  -a)2  (1  -a2) 


which,  for  a  1  —  6  and  5  small,  tends  toward 


( II—  26d ) 


Thus,  if  we  excite  our  high-gain  first-order  filter  with  a  low-frequency  sinusoid,  cr^/a2  is  pro¬ 
portional  to  l/6  rather  than  to  l/62  as  in  (II— 26b) .  For  a  sinusoidal  input  of  low  enough  fre¬ 
quency,  the  signal  goes  through  more  gain  than  does  the  roundoff  noise.  This  is  in  contrast  to 
the  case  of  white  input,  where  the  signal  and  noise  go  through  the  same  gain. 


3.  Noise-to-Signal  Ratios  -  Second-Order  Case 

We  now  proceed  to  derive  formulas  for  output  noise  variance  for  three  alternate  realizations 
of  the  same  two-pole,  one-zero  transfer  function.  Output  noise-to-signal  ratios  will  be  obtained 
for  the  case  of  a  white  input  signal. 

The  transfer  function  of  concern  is 


H(z) 


1  —  r  cos  0  z' 


,  „  -1  ,  2-2 
—  2r  cos  0  z  +  r  z 


(11-27) 


having  a  complex  pole  pair  at  z  =  re*''0,  and  a  zero  on  the  real  axis  at  z  =  r  cos  0.  The  system 
behaves  as  a  digital  resonator  with  resonant  frequency  determined  by  the  angle  0  and  the  sam¬ 
pling  rate. 

The  first  realization  to  be  considered  is  the  direct  form,  consisting  of  the  single  second- 
order  difference  equation 


y(n)  =  2r  eos  0  y(n  —  1 )  —  r2y(n  —  2)  +  x(n)  —  r  cos  0  x(n  —  1 )  (11-28) 

2  21 

Inspection  of  a  circuit  diagram  ’  for  (11-28)  would  show  that  the  noise  sources  due  to  the  three 
multiplications  pass  through  only  the  poles  of  the  filter.  The  resulting  output  noise  variance  is 

a2  =  3G,  cr  2  (II- 29a) 

e  l  o 

where 


G 


1 


- — ,) 

1  - r  '  Vr  +  1  -  4r  cos  0  +  2r  ' 


(II- 29b) 
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For  the  high-gain  ease,  where  r  -  1  —  6  and  6  is  small,  this  becomes 


2 

a 


e 


3cr  ‘ 


46  sin  6 


(11-30) 


We  have  assumed  that  rounding  of  the  produets  is  done  before  performing  the  needed  sums  to 
earry  out  the  recursion.  If  we  were  to  accumulate  the  sums  in,  say,  a  double  precision  accumu¬ 
lator,  before  rounding,  the  output  noise  would  be  somewhat  redueed. 

The  "canonic  form"  realization  for  our  system  is  represented  by  the  equations 

u(n)  =  x(n)  +  2r  cosO  u(n  —  1)  —  r 2u(n  —  2) 


y(n)-u(n)— rcos0u(n— 1)  (11-31) 

The  essential  difference  between  the  eanonie  and  direet  forms  is  that  in  the  canonic  form  the  sig¬ 
nal  passes  first  through  the  poles  of  the  filter  and  then  through  the  zero,  while  the  reverse  is 
true  for  the  direct  form.  In  the  canonic  form,  the  two  important  noise  sourees  in  the  system 
[those  introduced  in  computing  u(n)]  are  filtered  by  the  zero,  as  well  as  by  the  poles.  The  out¬ 
put  noise  variance  is 


where 


2 

a 

e 


2  G  <7  2 
2  o 
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1  —  r 

or,  for  r  close  to  unity, 

,  2 
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2  2  2 
r  sin  O  (1  +  r  ) 

+  1  —  4r2  eos20  +  2r2 


(II- 32a) 


(11-3  2b ) 


(11-33) 


Comparing  (11-30)  and  (11-33),  we  see  that  the  direct  form  leads  to  increased  noise  for  low 
resonant  frequencies  whereas  the  eanonie  form  does  not.  We  might  be  led  to  consider  the  ean¬ 
onie  form  superior  on  this  basis,  but  one  must  consider  signal  levels  as  well  as  noise  levels  in 
comparing  alternate  forms.  Since,  in  the  canonic  form,  the  signal  passes  first  through  the  poles 
and  then  the  zero,  the  input  signal  will  generally  have  to  be  smaller  than  in  the  direet  form,  in 
order  to  prevent  overflow.  A  noise-to-signal  ratio  analysis,  as  given  below,  yields  essentially 
the  same  results  for  the  two  forms. 

To  prevent  overflow  for  the  direet  form  (11-28),  we  require  that  |y(n)|  <1,  or  since  the 
overall  impulse  response  is 


that 


h(n)  =  r°  cos  (n0) 


(11-34) 


|x(n)|  <  — - - -  (11-35) 

£  r11  |  cos  (n0 )  | 

n=0 

Assuming  x(n)  white  and  uniformly  distributed  according  to  the  constraint  (11-35),  the  output 
noise-to-signal  ratio  for  the  direet  form  is 


24 


9G1% 


OO 
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1 

| cos (nO )  | 
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We  can  approximately  bracket  the  expression 

00  I 

Y  r11  |  cos  (n0)  |  / 

n  0 


by  noting  that  an  upper  bound  is 


(11-36) 


V  n  1 
u  r  1  _  r 
n  0 

A  lower  bound  is  obtained  by  observing  that  the  sum  of  the  absolute  values  of  an  impulse  re¬ 
sponse  is  the  maximum  attainable  output  value  from  a  filter  if  the  maximum  input  value  is  unity. 
The  maximum  output  of  our  second-order  system  at  resonance  (which  can  be  found  by  evaluating 
H(z)  in  (11-27)  at  z  e-*G )  thus  provides  a  lower  bound  on  the  sum  of  the  absolute  values  of  the 
impulse  response.  For  the  high-gain  case,  this  resonant  gain  is  approximately  1/26.  Thus, 
we  shall  consider 


<  Yj  r11  |  cos  (nO)|  <  y 
0 


(11-37) 


2 

Using  these  approximations  in  (11-36),  and  noting  that  (G^/G^)  ~  1/sin  0  for  the  high-gain  case, 
we  have  for  the  direct  form  that 


9  <t  ' 


462  sin2  0 


e 

2 


9<t 


r2  -2^ 
6  sin  0 


(direct  form) 


(11-38) 


To  prevent  overflow  in  the  canonic  form  (11-31),  we  require  !u(n)|  <  l,t  or  since  the  impulse 
response  of  the  filter  (poles  only)  between  x(n)  and  u(n)  is 


h(n) 


n  sin  [(n  t  1 )  0] 
sin  0 


(11-39) 


we  require 


I  x(n)  |  < 


1 


sin  O 


n  0 


|  sin  (n  +  1 )  0  | 


(11-40) 


This  is  a  lower  bound  on  the  input  signal  than  in  (11-35).  Using  approximations  similar  to  the 
above,  the  output  noise-to-signal  ratio  for  the  canonic  form  may  be  bracketed  as 


6a  ' 


6a  ' 


2  2  2  2  2 
46  sin  0  a  6  sin  0 

y 


(canonic  form) 


(11-41 ) 


t  Once  u(n)  is  computed,  it  must  be  ottenuated  by  at  least  1/(1  +  |  r  cos  9| )  so  thot  y(n)  does  not  overflow.  This 
ottenuation  hos  only  minor  effects  on  the  output  noise-to-signal  rotio,  ond  will  be  neglected. 
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This  predicts  essentially  the  same  behavior  of  noise-to-signal  ratio,  as  a  function  of  bandwidth 
and  resonant  frequency,  as  the  direct  form  result  (11-38). 

Now  wc  consider  a  third  realization  of  our  filter,  via  the  coupled  first-order  equations 

y(n)  =  rcos0y(n—  1)  —  r  sin0  u(n  —  1 )  +  x(n) 

u(n)  r  sinO  y(n  —  1)  +  r  cos  0  u(n  —  1)  .  (11-42) 

Assuming  (as  wc  have  throughout  this  discussion)  that  we  round  after  each  multiplication,  the 
output  noise  variance  for  r  near  unity  is 


CT 


2 
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2 

a 

o 

6 


(11-43) 


the  same  (except  for  a  factor  of  2)  as  the  canonic  form.  Hut  if  we  consider  also  the  overflow 
constraint,  we  shall  find  that  at  low  resonant  frequencies  the  coupled  form  noise-to-signal  ratio 
is  superior  to  the  canonic  and  direct  forms.  This  is  because  the  noise  is  less  than  in  the  direct 
form,  and  the  transfer  function  U(z)/X(z)  relating  the  intermediate  variable  u(n)  to  the  input  has 
essentially  the  same  gain  as  that  relating  y(n)  to  the  input,  unlike  for  the  canonic  form  where 
U(z)/X(z)  becomes  large  for  low  resonant  frequencies.  Using  approximations  similar  to  those 
used  in  obtaining  (11-38)  and  (11-41),  wc  can  bracket  the  noise-to-signal  ratio  as 


—  -j  4  <  — —■  (coupled  form)  .  (11-44) 

6  ^  <t  6 

y 


Unlike  in  the  direct  and  canonic  forms, 
filter  is  decreased. 


does  not  increase  as  the  resonant  frequency  of  the 


4.  Clamping 

As  we  have  seen,  the  requirement  of  no  overflows  severely  limits  the  attainable  signal-to- 
noise  ratio  at  the  output  of  a  fixed  point  recursive  filter.  An  alternative  to  completely  prevent¬ 
ing  overflow  is  to  insert  a  clamp,  or  saturating  element,  in  the  feedback  loop  of  the  filter. 

To  illustrate  the  use  of  a  clamp,  consider  the  digital  resonator 

y(n)  =  a1y(n  -  1)  +  a2y(n  -  2)  +  x(n) 

When  the  magnitude  of  the  sum  on  the  right-hand  side  of  this  equation  becomes  too  large  for  the 
fixed  point  computer  word  (greater  than  1),  the  output  is  made  to  assume  the  largest  fixed  value 
which  the  word  length  allows,  rather  than  permitting  an  overflow  and  abrupt  change  of  sign  of 
the  output.  This  clamped  result  is  then  used  in  computing  the  next  output. 

Clearly,  insertion  of  a  clamp  would  permit  larger  input  signal  levels  to  be  used,  and  thus 
increase  the  ratio  of  output  signal  to  noise.  But  this  apparent  improvement  must  be  traded  off 
against  the  distortion  introduced  by  the  clamp.  For  example,  in  situations  where  overflows 
would  occur  only  rarely,  a  clamp  might  be  a  useful  device,  either  to  improve  the  output  signal- 
to-noise  ratio  or  to  permit  a  system  to  be  realized  with  shorter  word  length. 

The  theoretical  analysis  of  a  clamped  filter  is  quite  complicated,  since  the  system  is  non¬ 
linear,  with  the  nonlinearity  in  a  feedback  loop.  Therefore,  a  series  of  experiments  were  per¬ 
formed  to  investigate  the  effects  of  clamping  in  both  first-  and  second-order  filters. 
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In  one  series  of  experiments,  the  inputs  used  were  sine  waves  (small  enough  to  cause  no 
overflow)  with  large,  overflow-producing  pulses  added.  With  the  clamp,  the  effect  of  the  pulses 
died  out  in  the  transient  response  time  of  the  filter,  and  steady-state  filter  behavior  was  re¬ 
stored.  If  the  clamp  was  removed  and  an  overflow  allowed  to  occur,  it  was  found  that  a  sequence 
of  many  additional  overflows  were  caused,  and  many  time  constants  of  the  system  would  pass 
before  steady-state  behavior  was  restored. 

In  another  set  of  experiments,  the  inputs  used  were  sine  waves  large  enough  to  cause  over¬ 
flow.  When  overflow  was  allowed  to  occur,  the  resulting  outputs  in  no  way  resembled  the  out¬ 
puts  that  would  have  occurred  with  no  overflow.  When  a  clamp  was  used,  the  resulting  outputs, 
for  a  variety  of  filter  parameters  and  input  frequencies,  resembled  clipped  sine  waves  and  re¬ 
tained  the  frequency  of  the  input.  Such  distortion  might  not  be  intolerable  in  a  chain  of  resona¬ 
tors,  for  example,  where  the  harmonic  distortion  might  be  filtered  out  by  other  resonators  in 
the  chain. 

These  experiments  indicated  that  a  clamp  would  be  a  useful  device,  e.g.,  in  a  hardware  dig¬ 
ital  filter  with  severely  limited  word  length.  One  would  expect  that  a  study  of  the  effects  of 
clamping  in  some  practical  system,  such  as  a  digital  vocoder,  might  well  yield  some  savings  in 
register  length  requirements. 

D.  ROUNDOFF  NOISE  -  FLOATING  POINT 

In  this  section,  we  analyze  the  effects  of  roundoff  noise  in  first-  and  second-order,  floating 

point  digital  filters.  Formulas  for  output  noise- to- signal  ratio  are  obtained,  which  later  will  be 

compared  with  the  corresponding  results  for  fixed  and  block  floating  point  arithmetic.  Our  sta- 
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tistical  method  of  analysis  follows  that  of  Kaneko  and  Liu.  Some  of  the  results,  both  theoreti¬ 
cs 

cal  and  experimental,  have  been  previously  reported  by  Weinstein  and  Oppenhcim. 

We  shall  employ  the  model  for  floating  point  roundoff  errors  which  has  been  presented  in 
Sec.  I-C-2,  assuming  rounding  rather  than  truncation.  With  respect  to  the  overflow  question, 
we  shall  assume  that  enough  bits  are  allotted  to  the  exponent  so  that  no  overflows  occur.  Later, 
in  comparing  floating  and  fixed  point,  we  shall  consider  how  many  bits  are  actually  needed  in 
the  exponent  to  satisfy  this  assumption. 

In  our  analyses,  several  of  the  formulas  given  in  the  preceding  section  on  fixed  point  arith¬ 
metic  will  be  found  helpful  also  for  the  floating  point  case. 

1.  First-Order  Case 

For  a  first-order  filter  of  the  form 

y(n)  =  ay(n  —  t)  +  x(n)  (11-45) 

the  actual  output  computed  with  floating  point  arithmetic  can  be  expressed  as 

w(n)  =  {  aw(n  —  1 )  [1  +  e  (n) ]  +  x(n)}  [1  4  £  (n) ]  .  (11-46) 

The  random  variables  e(n)  and  £  (n)  account  for  the  roundoff  errors  due  to  the  floating  point  mul¬ 
tiply  and  add,  respectively.  Following  Kaneko  and  Liu,  we  define  the  output  noise  sequence 
e(n)  =  w(n)  —  y(n),  subtract  (11-45)  from  (11-46),  neglect  second-order  terms  in  e,  e,  and  i; ,  and 
obtain  a  difference  equation  for  the  error  e(n),  as 

e(n)  -  ae(n  -  1)  =  ay(n  -  1 )  [c(n)  +  ^  (n)]  +  x(n)  £  (n)  =  u(n)  .  (11-47) 
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Assuming  (see  See.  I-C-2)  that  the  e(n)  and  |(n)  are  zero  mean  and  independent  from  sample-to- 

sample  (white),  and  that  e(n),  £(n),  and  the  signal  x(n)  are  mutually  independent,  u(n)  in  (11-47) 

is  white  noise  with  variance  dictated  by  the  statistics  of  the  signal  x(n),  and  the  variances  of  the 

roundoff  variables.  We  assume  that  the  variances  of  all  roundoff  variables  are  equal,  with  value 
2  2  -2t 

denoted  by  a  .  An  order-of-magnitude  estimate  of  a  ~  2  /3  ean  be  obtained  by  assuming  that 

_te  2 

e(n)  is  uniformly  distributed  between  its  limits,  ±2  .  Empirieal  values  of  a  are  slightly  less 

than  this,  and  the  value 

CTf2  =  (0.23)2"2t  (11-48) 

will  be  used  in  comparing  theory  and  experiment. 

Now,  if  we  assume  that  x(n)  is  stationary,  then  u(n)  is  stationary  white  noise,  and  by  using 

the  method  of  See.  II-C-1,  the  variance  a  2  of  the  output  noise  e(n)  ean  be  obtained  easily  from 

2  e 
the  variance  <j  of  u(n)  as 
u 


2  2  V  I.2,  ,12  ...  ,n, 

<7  cr  h  (n)  - =r  <7  (11-49) 

e  u  >->  '  .  2  u 

rs  1  —  a 

n  0 

where  h(n)  =  an  is  the  filter  impulse  response. 

At  this  point,  in  order  to  proeeed  any  further,  we  must  assume  speeifie  statistics  for  the 
input  signal  x(n).  Unlike  the  fixed  point  ease,  we  eannot  obtain  output  noise  varianee  without 
assuming  signal  statistics. 

Let  us  assume,  for  example,  that  x(n)  is  stationary  white  noise  of  varianee  <j\  Then,  we 
obtain 


1  +  a 


(1  -  a  ) 


2  2 
— T  <7  (7 

2,2  c  x 


(11-50) 


Noting  that  <r 


2 

y 


<7X/(1  -  a  ), 


1  +  a 

1  -  a2 


we  ean  express  (11-50)  as  a  noise-to-signal  ratio 


(7 


2 

c 


whieh  beeomes  for  the  high-gain  ease,  where  a  -  1—6, 


(II  —  51a) 


2 


(II  —  51b) 


As  might  be  expeeted,  the  noise-to-signal  ratio  increases  as  the  pole  moves  toward  the  unit  eir- 
ele.  The  rate  of  increase,  however,  is  not  so  great  as  in  the  ease  [see  ( II  —  26b ) ]  where  a  fixed 
point  filter  is  exeited  by  a  white  signal. 

As  another  example,  x(n)  was  taken  to  be  a  sine  wave  of  the  form  A  sin  (a>Qn  +  ip)  with  tp 
uniformly  distributed  in  (0,  2tt).  The  results  are 
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(7 
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2  2  2 
A  (1  +  a  j  of 

2(1  —  a2)  (a2  —  2a  eos  +  1 ) 


(11-52) 
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and 


(11-53) 


The  noise-to-signal  ratio  turns  out  to  be  the  same  as  for  the  white  noise  input. 

To  test  the  model,  er^  was  measured  experimentally  for  white  noise  and  sine  wave  inputs. 

The  sine  wave  case  was  included  beeause  it  was  felt  that  our  assumption  of  uncorrelated  round¬ 
off  variables  was  most  in  doubt  for  the  case  of  a  highly  correlated  signal.  To  earry  out  the  meas¬ 
urements,  eaeh  input  was  applied  to  a  filter  using  a  27-bit  mantissa,  and  also  to  a  filter  with  the 
same  eoeffieient  a,  but  using  a  shorter  (e.g.,  12-bit)  mantissa  in  the  computation.  The  out¬ 
puts  of  the  two  filters  were  then  subtracted,  squared,  and  averaged  over  a  sufficiently  long  pe¬ 
riod  to  obtain  a  stable  estimate  of  a  .  Beeause  the  measurement  time  needed  for  a  stable  esti- 

e 

mate  increases  sharply  as  the  pole  of  the  filter  approaches  z  -  1,  0.95  was  the  largest  value  used 
for  a. 

In  order  to  eompare  theory  and  experiment,  it  was  neeessary  to  use  a  numerical  value  for 

2  2 
a  .  Direct  measurements  on  roundoff  error  sequences  showed  that  a  for  a  multiplication  was 

slightly  different  than  for  an  addition,  and  that  varied  somewhat  depending  on  the  signal  used 

and  (for  multiplication)  on  the  eoeffieient  a.  However,  it  was  found  that  rms  output  noise  could 

be  predieted  within  less  than  one  bit,  if  these  slight  differences  were  neglected  and  the  empirical 

"average"  or  ^  of  (11-48)  was  used. 

In  Fig.  5,  experimental  eurves  for  noise-to-signal  ratio  are  compared  with  the  predictions 
of  the  model.  The  same  theoretical  eurve  suffices  for  both  white  and  sinusoidal  inputs,  since 


Fig.  5.  Theoreticol  ond  experimentol  noise-to-signol  rotio  for  flooting  point, 
first-order  filter,  os  o  function  of  pole  position.  Noise-to-signol  rotio  rep¬ 
resented  in  bits. 
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( 11- 51a)  and  (11-53)  are  identical.  The  quantities  actually  displayed  on  the  curve  are  values  of 


a  normalized  rms  noise-to-signal  ratio  represented  in  units  of  bits.  We  note  that  the  agreement 
of  theory  with  experiment  is  generally  quite  good. 


2.  Second-Order  Case 


An  analysis  similar  to  the  above  can  be  carried  out  for  a  second-order  filter  of  the  form 


y(n)  =  r  y(n  —  2)  +  2r  cos  0  y(n  —  1 )  +  x(n) 


(H-54) 


with  a  complex  conjugate  pole  pair  at  z  =  r  e 


When  x(n)  is  stationary  white  noise,  we  obtain  for  the  variance  of  the  noise  e(n), 
2  2  2LJ_2/,  4J,,2  2  _  , ,  r4  cos2  ©\1 

%  =  CTe  ax  lGl  +  G1  l3r  +  12r  COS  6  '  16  1'+  r 2  ~jj 


(11-55) 


where 


1  C  _  r2  ^  C4  +  1  -  4r2  cos20  +  2r2^ 


2  2 

as  in  (II- 2 9b).  Noting  that  a  =  G,a  ,  we  can  express  the  noise-to-signal  ratio  as 

y  ix 
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!1  (3 
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77^)1 


(II- 56a) 


which,  for  high-gain  r  =  1  —  6,  becomes 
2 


%  ^2  /  3  +  4  c  os  4  0  \  . 

-2  e  '  46  sin2  O  ' 


(11-5  6b ) 


These  formulas  also  describe  closely  the  output  noise-to-signal  ratio  for  a  two-pole,  one-zero 
system  as  in  (11-27),  realized  via  the  canonic  form  (11-31),  since  implementation  of  the  zero 
after  the  pole  pair  has  negligible  effect  on  the  noise-to-signal  ratio,  especially  for  the  high-gain 
case.  The  result  for  the  direct  form  (11-28)  is  essentially  the  same,  and  if  the  order  of  opera¬ 
tions  in  (11-28)  is  such  that  x(n)  —  r  cos0  x(n  —  1)  is  computed  before  adding  the  effects  of  the 
past  output  samples,  then  with  the  high-gain  approximation  the  noise-to-signal  ratio  for  the  di¬ 
rect  form  will  be  exactly  the  same  as  (11-5 6b). 

As  in  the  fixed  point  case,  a  coupled  form  realization  (11-42)  of  the  two-pole,  one-zero  net¬ 
work  yields  an  improved  noise-to-signal  ratio  at  low  resonant  frequencies.  Assuming  white  in¬ 
put  and  the  high-gain  approximation,  the  result  for  the  coupled  form  (the  lengthy  derivation  is 
omitted)  is 


CTe  2  / 3  +  4  sin2©\ 
a2  \  46  } 


(11-57) 
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The  factor  l/sin  9  of  (II-56b)  is  not  present  in  (11-57).  We  should  re-emphasize,  however, 
that  implementation  of  the  coupled  form  requires  more  computation  than  the  direct  or  canonic 
forms . 

Let  us  return  attention  to  the  second-order  system  of  (11-54).  In  addition  to  the  analysis  for 
white  noise  input,  the  output  noise  variance  has  been  derived  for  a  sinusoid  A  sinfw^n  +  </>).  The 
result  is 

2  .  2„  2.3  4  i  u|  2  /  2  2  Q  i  „  1 2  1 

cre  =  A  G^cr  [2  r  I  H  |  +  6r  cos  ©  |  H  |  +  -j 


—  4r  ^  |  H  |  ^  cos  0  cos  —  A  |  H  |  cos  (<<  —  2u)q) 

+  2r  |  H  |  cos  ©  cos  (ty  —  co  )]  (11-58 ) 

where  |  H  |  and  'k  represent  the  magnitude  and  phase  of  the  filter  system  function  at  the  input 
frequency  o)Q. 


TABLE  I 

THEORETICAL  AND  EXPERIMENTAL  NOISE-TO-SIGNAL  RATIO 
FOR  FLOATING  POINT,  SECOND-ORDER  FILTER, 

AS  A  FUNCTION  OF  POLE  POSITIONS 
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White  Noise 
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Theoretical 

Experimental 

Theoretical 

Experimental 

0. 55 

22.5 

1.48 

1.66 

1.54 

1.64 

0.7 

22.5 
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2.33 

2.23 

2.38 

0.9 
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3.32 

3.33 

3.35 

3.45 

0.55 

45.0 
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0.97 

0.94 

0.7 

45.0 

1.36 

1.44 

1.37 

1.51 

0.9 

45.0 

2.28 

2.51 

2.22 
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0.55 

67.5 

0.42 

0.46 

0.39 

0.33 

0.7 

67.5 

0.  75 

0.88 

0.65 

0.62 

0.9 

67.5 

1.63 

1.97 

1.45 

0.99 

By  using  both  white  noise  and  sinusoidal  inputs,  experimental  noise  measurements  were  per¬ 
formed  for  the  two-pole  system  (11-54).  In  Table  I,  a  comparison  of  theoretical  and  experimen¬ 
tal  results  for  output  noise-to-signal  ratio  is  displayed;  (11-48)  was  assumed.  The  agreement 
of  theory  with  experiment  is  generally  quite  good. 

E.  ROUNDOFF  NOISE  -  BLOCK  FLOATING  POINT 

26) 

A.  V.  Oppenheinri  has  proposed  a  realization  of  recursive  filters  using  block  floating  point 
arithmetic,  and  analyzed  the  effect  of  roundoff  errors  on  such  a  realization.  Here,  we  review 
some  of  his  work  to  set  the  stage  for  a  comparison  with  fixed  and  floating  point  arithmetic. 
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In  block  floating  point  arithmetic,  the  input  and  filter  states  (i.  e.,  the  inputs  to  the  delay 
registers)  are  jointly  normalized  before  the  multiplications  and  additions  are  performed  with 
fixed  point  arithmetic.  The  scale  factor  (or  exponent)  obtained  during  the  normalization  is  then 
applied  to  the  final  output  to  obtain  a  fixed  point  output. 

A  block  floating  point  realization  of  the  first-order  filter 

y(n)  =  x(n)  +  ay(n  -  1)  (11-59) 

can  be  represented  as 

y(n)  =  ATnj  (  U(n(-l)l  A(n  -  1  >  x<n>  +  a  Ufn  -  l)1  A(n  “  1  >  y<n  "  1 »  (H-60) 

where  A(n)  represents  the  scale  factor  used  to  jointly  normalize  x(n)  and  y(n—  1).  A  first  guess 
at  a  suitable  A(n)  might  be 

A(n)  =  max  [  |  x(n)  |  ,  |  y(n  -  l)fi  '  (I1"6l) 

In  practice,  A(n)  would  probably  be  restricted  to  be  a  power  of  two  so  that  a  left  shift  rather 
than  a  multiplication  could  be  used  in  the  normalization.  Also,  depending  on  the  order  of  the  fil¬ 
ter,  a  somewhat  smaller  value  of  A(n)  than  in  (11-61) 
would  be  needed  to  prevent  overflow;  for  a  first-order 
filter,  halving  the  value  of  A(n)  given  in  (11-61)  would 
prevent  overflow.  Figure  6  is  a  circuit  diagram  rep¬ 
resenting  (11-60).  The  operation  is  as  follows.  The 
input  sample  x(n)  is  multiplied  by  the  previous  scale 
factor  A(n  —  1),  then  this  quantity  A(n  —  1)  x(n)  is 
jointly  normalized  with  the  contents  of  the  delay, 

A(n  —  1),  to  determine  the  incremental  scale  factor 
A(n)  =  A(n)/A(n— 1).  A(n)  y(n)  is  then  computed, 
stored  in  the  delay,  and  multiplied  by  l/A(n)  to  yield  a  fixed  point  output  sample.  We  are  then 
ready  for  a  new  input  sample. 

A  roundoff  noise  source  is  introduced  in  the  multiplication  of  A(n  —  1)  y(n  —  1)  by  A(n),  the 

multiplication  by  a,  and  the  multiplication  by  l/A(n).  Assuming  that  all  noise  sources  have  zero 

2 

mean  and  equal  variance  a ^  and  are  independent  of  A(n),  the  output  noise  is 


t/A  ( n) 


A(n-t)  A(n) 
x(n)  o  —  ► 


»<n) 


A(n) 


Fig.  6.  Block  diagram  representing  block 
floating  point,  first-order  filter. 


,  ,  1  +  a‘  f,  1  ,2 

1  +  — IT  6[AhT)] 

l  —  a 


(11-62) 


where  6  denotes  expected  value.  Later,  for  comparison  with  fixed  and  floating  point,  we  shall 
assume  a £  =  (l/l2)2-  ,  as  in  fixed  point. 

To  compare  this  result  with  fixed  and  floating  point,  we  need  a  formula  for  output  noise-to- 
signal  ratio.  The  block  floating  point  filter  yields  a  fixed  point  output,  and  this  output  must  fit 
within  a  register.  There,  the  constraint  on  input  signal  is  as  described  in  (11-22),  and  it  is  rea¬ 
sonable  to  assume  a  white  input  with  variance  as  given  in  (11-24).  The  result  for  output  noise- 
to-signal  ratio  is 


a 


2 

e 


3(1  -  a2)  3(1  +  a2)  «_  2 

<l-a)2  (1_a)2  fc[A(n)1 


If  we  assume  the  high-gain  case  where,  with 


(11-6  3a) 
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.  ,  _ L_2 _  1 

(n  max  [  |  x(n)  | ,  |  y(n  -  1)|  |  ~  2|y(n)| 


2  2 

so  that  6[l/A(n)]  =  4o-  ,  then,  with  a  =  1—6,  the  result  is 


e  10  2 

2  “  6  ^ 


(II-63b) 


Similarly,  a  block  floating  point  realization  can  be  defined  and  a  noise  analysis  carried  out 
for  the  second-order  filter 


y(n)  =  x(n)  +  2r  cos  0  y(n  —  1 )  —  r  y(n  —  2) 
The  output  noise-to-signal  ratio  (for  white  input)  is 

'  2 

-4-»  V  r11  |  s  in  (n  +  1 )  ©  | 
sm8  1  1 


2  ,2 
a  3a, 
e  _ £_ 

7^  ^7 

y 


n=0 


2  lA(n)' 

ay 


(2  +  2r4  +  4r2  cos20) 


(11-64) 


which  for  the  high-gain  case  and  small  0  can  be  approximately  bracketed  according  to 
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I  |3  + 


■  2  „ 
sin  0 


2  %. 

<  2 
(T 


12  + 
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•  2  „ 
sin  0 


(11-65) 


The  results  of  noise  measurements  performed  by  Oppenheim  on  block  floating  point  filters 
have  corresponded  well  with  the  formulas  just  presented. 


F.  FIXED  VS  FLOATING  VS  BLOCK  FLOATING  POINT 


Having  obtained  experimentally  verifiable  noise-to-signal  ratios  for  fixed,  floating,  and 
block  floating  point  recursive  filters,  we  are  now  in  a  position  to  make  a  comparison.  We  should 
emphasize  that  our  comparison  is  on  a  very  restricted  basis.  First,  many  factors  other  than 
noise-to-signal  ratio  —  such  as  computing  facilities  available,  hardware  requirements,  ease  of 
programming  —  enter  into  a  choice  of  type  of  arithmetic.  Second,  we  shall  be  comparing  noise- 
to-signal  ratios  only  for  white  input,  and  the  results  for  other  types  of  signals  will,  in  general, 
be  different.  For  example,  cr  /cr  for  a  high-gain,  first-order,  fixed  point  filter,  with  low- 

^  y  2 

frequency  (rather  than  white)  input  will  tend  to  be  proportional  to  1/6,  rather  than  to  l/6  [com¬ 
pare  (II- 26b)  and  ( II - 26d ) ] .  If  we  assume  low-frequency  input  for  the  floating  point  case,  the 
l/6  dependence  in  (II— 51b)  will  still  hold.  Thus,  at  least  for  this  example,  the  fixed  point  case 
compares  more  favorably  to  the  floating  point  case  if  an  in-band,  rather  than  white,  signal  is 
assumed.  We  proceed  with  our  white  signal  based  comparison,  with  the  reservation  that  one 
must  be  careful  in  deciding  whether  the  comparison  is  valid  for  his  particular  application. 

For  a  first-order  filter,  and  white  input,  let  us  summarize  the  results  derived  above  for 
output  noise-to-signal  ratio: 


2 


fixed  point 


1  2-2t 

4(1  -  a)2 


(11-66) 
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(11-67) 


I) 

y  /floating  point 


(0.23)  (1  +  a  )  2-2t 

(1 


2. 

a  ) 


2\ 

<r  \ 

-l) 

ay /block  floating 


+  (1  +  a  L  k  2-2t 

4(1  ~a)  4(1  -a)2 


where 

k  =  6{[1/A(n)]2} 

and 


(11-68) 


( II  —  6  9a ) 


A(n) 


_ 1/2  _ 

max  [  |x(n)| ,  |y(n  —1)1] 


( II- 6  9b ) 


2 

The  quantity  k  in  (11-68)  is  signal  dependent.  For  high  gain,  we  would  expect  k  *  4 <ry . 
heim  has  measured  experimentally  the  ratio 


Oppen- 


k'  = 


k/4ay2 


(11-70) 


for  block  floating  point  filters  with  the  coefficient  ranging  from  a  0.1  to  0.95.  The  values  of 
k'  were  all  between  2  and  3.  Rewriting  (11-68)  in  terms  of  the  quantity  k',  wc  have 


2  J 

°y /block  floating 


f  (1  +  a) 

[4(1  -  a) 


+ 


] 

-  a  )  J 


2t 


(11-71) 


Figure  7  compares  curves  of  output  noisc-to-signal  ratio  as  given  by  (11-66),  (11-67),  and 

2  —  2t  2 

(11-71).  Actually,  we  plot  in  each  case  the  quantity  |  log?(cr  /2  a  ),  representing  an  rms 

c  c  y 

output  noise- to-signal  ratio,  in  units  of  bits.  For  the  block  floating  point  case,  Oppenheim's 
experimental  values  of  k'  were  used  up  to  a  =  0.95;  for  a  >  0.99,  the  high-gain  approximation 
k'  =  1  was  used.  From  Fig.  7,  we  observe  that  floating  point  arithmetic  leads  to  the  smallest 
noise-to-signal  ratio,  block  floating  point  the  next  smallest,  and  fixed  point  the  largest.  Wc 
also  notice  that  for  high-gain  filters,  as  a  increases  toward  unity,  the  noise-to-signal  ratio  for 
fixed  point  increases  faster  than  for  floating  or  block  floating  point. 

This  comparison,  however,  is  based  on  the  assumption  that  the  number  of  bits  in  the  man¬ 
tissa  is  the  same  for  the  three  types  of  arithmetic.  The  comparison  does  not  account  for  the 
additional  bits  needed  for  the  characteristic  in  either  block  floating  or  floating  point  arithmetic. 

In  floating  point,  all  quantities  are  represented  with  a  mantissa  and  characteristic.  If  c 
denotes  the  number  of  bits  in  the  characteristic,  this  could  be  accounted  for  in  Fig.  7  by  numer¬ 
ically  adding  the  constant  c  to  the  floating  point  data.  This  shift  will  cause  the  floating  and  fixed 
point  curves  to  cross  at  a  point  where  the  noise-to-signal  ratios  are  equal  for  equal  total  reg¬ 
ister  lengths.  To  compare  fixed  and  floating  point  on  this  basis,  we  provide  just  enough  bits  in 
the  characteristic  to  allow  the  same  dynamic  range  for  both  the  fixed  and  floating  point  filters. 

If  tj  denotes  the  length  of  the  fixed  point  fraction,  then  the  requirement  of  identical  dynamic 
range  implies  that 


c  =  1°S2tfx 


(11-72) 
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Fig.  7.  Comparison  of  fixed,  floating,  ond  block  floating  point  noise-to-signol 
ratios  —  first-order  filter. 


By  assuming,  for  example,  that  t^x  =  16  so  that  c  =  4,  a  crossover  point  in  noise-to-signal  ratio 
will  occur  at  a  =  0.996. 

For  block  floating  point,  one  exponent  is  used  for  both  x(n)  and  y(n  —  1),  so  it  is  not  as 
straightforward  to  compare  fixed  and  block  floating  point  on  the  basis  of  equal  register  lengths. 
However,  the  issues  in  such  a  comparison  should  be  clear. 

Now  let  us  consider  the  second-order  case,  specifically  a  filter  of  the  form  (11-54)  with  a 
complex  conjugate  pole  pair.  For  white  input,  our  results  for  output  noise-to-signal  ratio  can 
be  summarized  as: 
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where  now 
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(11-74) 
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k  =  6{[1/A(n)]2} 


A(n) 
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max  (|x(n)|,  |y(n  -  1)|,  |y(n  -  2)|  ] 
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Fig.  8.  Comporison  of  fixed  and  floating  point  noise-to-signal  ratios  —  second-order  filter. 


We  do  not  have  available  measured  values  of  k  for  the  second-order  case,  so  in  Fig.  8  we 

display  the  results  only  for  fixed  and  floating  point/  0  has  been  fixed  at  20°,  and  r  is  varied. 

The  comparison  between  fixed  and  floating  point  is  similar  to  the  first-order  case  depicted  in 

Fig.  7.  For  tfx  =  16,  a  crossover  point  in  noise-to-signal  ratio  would  occur  at  r  =  0.998,  0  =  20°. 

Let  us  re-emphasize  that  the  fixed  vs  floating  point  comparisons  of  Figs.  7  and  8  are  valid 

only  for  white  input.  As  we  illustrated  at  the  end  of  Sec.  II-C-2,  a  2/<j  2),.  ,  .  .  is  quite  de- 

e  y/iixed  point  n 

pendent  on  the  signal  spectrum  so  that  assuming  different  signal  statistics,  we  might  obtain  a 

very  different  comparison.  In  Sec.  II-D-i,  we  found  that  <72/a2V.  ..  .  .  was  the  same  for 

e'  y /floating  point 

white  and  sinusoidal  inputs,  indicating  that  the  floating  point  noise-to-signal  ratio  is  relatively 
insensitive  to  changes  in  the  signal  spectrum. 


t  The  relationship  between  the  fixed  ond  floating  point  curves  in  Fig.  8  is  different  than  in  Fig.  2b  of  Weinstein 
and  OppenheimpO  becouse  on  error  wos  mode  in  the  earlier  paper,  where  all  points  on  the  fixed  point  curve 
should  be  1.57  bits  higher  than  the  values  displayed.  With  this  correction,  the  earlier  figure  will  be  seen  to 

correspond  to  our  Fig.  8,  if  we  note  that  the  quantity  plotted  in  the  earlier  paper  wos  1/2  log 

1/2  logj  [cr^/(0.  23)2  insteod  of  1/2  \o^^{a^/2  as  in  Fig.  8. 
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m.  THE  FAST  FOURIER  TRANSFORM  (FFT) 


tn  this  section,  we  study  the  effects  of  quantization  on  implementation  of  the  FFT  algorithm. 
Errors  due  to  coefficient  quantization  and  to  rounding  in  the  computation  will  each  be  consid¬ 
ered.  Schemes  for  scaling  the  array  so  that  no  overflows  occur  will  be  discussed.  The  differ¬ 
ing  quantization  effects  in  fixed,  floating,  and  block  floating  point  arithmetic  will  be  examined 
and  compared. 

There  are  many  forms  of  the  FFT  algorithm,  and  the  detailed  effects  of  quantization  will 
differ  depending  on  the  form  used.  Our  specific  results  will  be  obtained  for  the  case  of  a  radix  2 
algorithm.  Much  of  the  theory  will  be  valid  for  any  of  the  standard  radix  2  algorithms,  but  all 
experiments  have  been  carried  out  using  a  decimation  in  time  algorithm.  ’  We  feel  that  most 
of  the  ideas  employed  in  the  error  analysis  of  simple  radix  2  algorithms  can  be  utilized  in  error 
analysis  of  other  forms  of  the  algorithm  (mixed  radix,  etc.). 

Our  approach  in  analyzing  noise  in  the  FFT  is  basically  statistical.  In  most  cases,  we  sup¬ 
port  the  predictions  of  our  models  with  experimental  data.  For  floating  and  block  floating  point 
arithmetic,  in  order  to  simplify  analysis  and  obtain  concrete  results,  we  find  it  convenient  to 
assume  a  simple,  white  model  for  the  signal  being  transformed.  Discussion  of  how  the  results 
might  be  expected  to  change  for  other  types  of  signals  is  included,  as  are  experimental  noise 
measurements  on  FFT's  of  nonwhite  signals. 


A.  COEFFICIENT  QUANTIZATION 

In  computing  a  discrete  Fourier  transform  (DFT)  via  the  FFT,  the  system  coefficients  used 
are  values  of 


wk  =  e(-j27r/N)k 


cos 


2  irk 
N 


sin 


Zirk 

N 


(III— 1) 


for  k  =  0,  1,  .  .  .  ,  N  —  1,  where  N  is  the  number  of  points  transformed.  This  implies  a  require¬ 
ment  of  2N  real  coefficients,  but  symmetries  in  the  trigonometric  functions  can  be  used  to  re¬ 
duce  the  number  needed.  These  coefficients  may  be  obtained  in  various  ways,  for  example  via 
table  look  up,  or  from  the  iteration  of  a  difference  equation  with  poles  on  the  unit  circle.  But 
in  whatever  manner  the  coefficients  are  obtained,  they  can,  because  of  the  finite  word  length, 
only  approximate  the  true  values  of  the  functions. 

Our  aim  here  is  to  estimate  the  error  in  a  computed  DFT,  due  to  the  errors  in  these  sys¬ 
tem  coefficients.  To  separate  the  errors  due  to  coefficient  quantization  and  roundoff  noise,  we 
shall  assume  in  this  discussion  that  all  arithmetic  is  performed  with  perfect  precision,  though 
the  coefficients  are  quantized.  For  the  purpose  of  analysis,  we  shall  assume  that  all  coefficients 
in  ( III  - 1 )  are  obtained  from  a  stored  table  which  is  accurate  to  the  full  computer  word  length  ex¬ 
cept  for  a  roundoff  in  the  last  bit.  Our  initial  discussion  will  pertain  to  fixed  point  arithmetic, 

but  we  shall  show  how,  with  a  slight  modification,  the  analysis  can  be  applied  also  to  the  floating 

u 

point  case.  We  shall  deal  with  the  case  of  a  radix  2  FFT  algorithm,  where  N  =  2  ,  and  the  DFT 
is  computed  in  v  stages. 

To  set  up  the  problem,  we  first  write  the  definition  of  the  DFT  of  an  N-point  sequence, 

N-i 

X(k)  =  £  x(n)  Wnk  k  -  0,  1, - N-l  .  (III-2) 

n=0 
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nk 

This  can  be  viewed  as  a  multiplication  of  the  N-vector  x(n)  by  the  N  x  N  matrix  [W  ].  Instead 
of  (III-2),  an  FFT  computation  using  rounded  coefficients  would  yield 

N-l 

X(k)  -  V  x(n)J2nk  (III -3) 

n-0 

nk 

where  the  matrix  [S2nk]  is  different  from  [W  j.  For  a  radix  2  algorithm,  each  element  SI  nk  will 
be  a  product  of  v  factors  obtained  from  our  sine-cosine  table.  Thus, 


nnk  =  n  (Wai  +  6.) 

i=  1 


(III  —4a. ) 


where 


[]  Wai  =  Wnk  (III —4b ) 

i=  1 

and  6.  represents  a  rounding  error  in  one  of  our  tabled  coefficients.  It  is  to  be  understood  in 
(II 1—4 )  that  the  u  values  of  Wai  and  6.  are,  in  general,  different  for  each  SI  .  The  complex 
rounding  error  6^  satisfies  the  bound 


|6il<'/22't  (II I-S) 

and,  if  assumed  to  obey  our  usual  statistical  assumption  on  roundoff  errors,  would  have  zero 
mean  and  variance 


ft  |  <5.  |  ^  =  (2)2'2t/l2  =  -g-  . 

The  error  E(k)  in  a  computed  DFT  value  can  be  expressed  as 

N-l  N-l 

E(k)  =  X(k)  —  X(k)  =  £  x(n)  (SInk  -  Wnk)  =  £  x(n)  6SI  nk 
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(III  -7 ) 


n=0 
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so  that  the  error  E(k)  is  the  result  of  multiplication  of  the  input  sequence  x(n)  by  the  "error  ma- 
trix"  [Mis¬ 
using  (III  — 4),  ( I II  —  5 ),  and  (III— 7 ),  we  can  derive  a  deterministic  bound  on  the  elements  of  the 

error  matrix,  and  from  this  obtain  a  deterministic  bound  on  the  ratio  of  mean-squared  output 

2  -2t 

error  to  mean-squared  output  signal.  However,  the  bound  obtained  is  proportional  to  Nu  2 
which  as  we  shall  see  below  is  quite  pessimistic  compared  with  results  which  normally  occur. 

Using  instead  a  random  model  for  the  error  matrix,  we  can  estimate  statistically  the  ex¬ 
pected  output  noise-to-signal  ratio.  The  predictions  of  the  model  can  then  be  checked  against 
experimental  error  measurements.  This  approach  is  now  discussed. 

From  (III-4)  and  (III— 7 ) ,  we  can  obtain  an  expression  for  an  element  6SInk  of  the  error  matrix 


<5  SI 
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=  W3;  +  higher  order  terms 
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( III— 8 ) 
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where  the  higher  order  terms  involve  products  of  rounding  errors,  and  are  thus  of  the  order 
-2t 

2  .  Assuming  t  sufficiently  large  so  that  these  higher  order  terms  can  be  neglected,  observ¬ 
ing  that  |  W |  1,  assuming  the  <5.  mutually  uncorrelated,  and  using  ( III -6 ) ,  we  obtain 


fe|  an 
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(111-9) 


Making  the  further  assumption  that  all  elements  II  ^  are  uncorrelated  with  each  other  and  with 
the  signal  x(n),  we  obtain 


t;|E(k)|2  =  cr2 
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or  by  Parseval's  relation 
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This  is  the  desired  result  for  the  ratio  of  mean-squared  output  error  to  mean-squared  output 
signal. 

Some  of  the  assumptions  used  in  obtaining  (111-11)  are  clearly  in  doubt.  For  example,  each 
of  the  coefficients  stored  in  our  table  is  used  a  number  of  times  in  the  FFT  computation,  and,  of 
course,  the  roundoff  error  in  this  coefficient  is  the  same  each  time  it  is  used.  Thus,  the  as¬ 
sumptions  of  mutually  uncorrelated  <5.  and  mutually  uncorrelated  are  quite  questionable. 

Further,  we  have  assumed  that  all  transmissions  in  our  FFT  flow  chart  are  along  paths  where 
multiplication  by  an  inaccurate  coefficient  is  required,  whereas  in  many  cases  only  an  addition 
or  subtraction,  and  thus  no  multiplication  by  an  inaccurate  coefficient,  is  required.  For  instance, 
in  a  standard  decimation  in  time  algorithm,  the  matrix  element  fi  qq  would  be  identically  equal  to 
1,  with  no  error.  This  might  lead  us  to  believe  that  our  estimate  (1I1-H)  for  error-to-signal 
ratio  might  be  slightly  high. 

Although  we  might  not  expect  (111-11)  to  predict  with  great  accuracy  the  error  in  an  FFT  due 
to  coefficient  quantization,  it  would  be  helpful  even  as  a  rough  estimate  of  the  error.  The  key 
result  of  (III-ll),  which  we  would  like  to  test  experimentally,  is  that  the  error-to-signal  ratio 
increases  very  mildly  with  N,  being  proportional  to  u  -  log^N,  so  that  doubling  N  produces  a 
very  slight  increase  in  error-to-signal  ratio.  When  we  consider  later  the  output  noise  due  to 
roundoff,  we  shall  find  (for  the  fixed  point  case)  a  much  sharper  increase  with  N  of  the  output 
noise-to-signal  ratio. 

To  test  this  result,  experimental  measurements  on  errors  due  to  coefficient  quantization 
were  made’.  In  each  run,  a  sequence  x(n)  —  white  in  the  sense  that  all  2N  real  numbers  making 
up  the  N-point  complex  sequence  were  mutually  uncorrelated,  with  zero  means,  and  equal  var¬ 
iances  —  was  obtained  using  a  random  number  generator.  This  sequence  was  transformed  twice, 
once  using  a  36-bit  coefficient  table,  and  once  using  a  coefficient  table  rounded  to  much  shorter 
word  length  (e.g.,  12  bits).  For  each  transform,  36-bit  accuracy  was  used  in  the  arithmetic  to 
make  the  effect  of  roundoff  error  negligible.  The  results  were  subtracted,  squared,  and  divided 
by  the  output  signal  variance  (N  times  the  input  signal  variance)  to  obtain  an  experimental  output 
error-to-signal  ratio.  For  each  value  of  N,  several  random  sequences  were  transformed  and 
the  results  were  averaged  to  obtain  statistically  convergent  estimates. 
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Fig.  9.  Errors  due  to  coefficient 
quantization  in  FFT. 


These  results  are  displayed  in  Fig.  9;  the  quantity  plotted  is  a^/2  2V-^,  where  a 2  - 
6|X(k)|2.  The  theoretical  curve 
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corresponding  to  (111-11),  is  shown,  and  the  circles  represent  measured  output  error-to-signal 
ratio  for  the  fixed  point  case.  We  note  that  the  experimental  results  generally  lie  below  the 

theoretical  curve.  No  experimental  result  differs  by  as  much  as  a  factor  of  two  from  the  the- 

2  2 

oretical  result,  and  since  a  factor  of  two  in  corresponds  to  only  half-a-bit  difference  in 

the  rms  output  error,  it  seems  that  (III-12)  is  a  reasonably  accurate  estimate  of  the  effect  of 
coefficient  errors.  The  experimental  results  do  seem  to  increase  essentially  linearly  with  u, 
but  with  smaller  slope  than  given  in  (111-12). 

In  the  above,  fixed  point  arithmetic  has  been  assumed.  However,  since  a  block  floating 
point  FFT  will  generally  use  fixed  point  coefficients,  our  results  are  valid  for  the  block  floating 
point  case  also.  With  some  slight  modifications,  we  can  obtain  similar  results  for  the  floating 
point  case. 

In  fact,  the  argument  leading  to  (111-11)  applies  directly  to  the  floating  point  case,  except 
that  for  floating  point  we  must  consider  that  the  expected  magnitude  of  the  rounding  error  in  a 
tabled  coefficient  is  proportional  to  the  magnitude  of  the  coefficient.  Thus,  the  tabled  coefficient 
approximating  W  will  have  the  rounded  value 

cos  Hr  ^  +  e> -  j  sin< (■* +  * > 
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If  we  assume  random  roundoff  variables  e  and  £,  with  variance  <7  =  <7.  ,  then  the  roundoff  error 

€  £ 

<s  „  27rk  .  2rk 

6  =  e  cos  — ^ - sin  - 
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has  variance 
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With  (111-13)  replacing  (III— 6),  the  remainder  of  the  argument  leading  to  (III  — 1 1)  and  ( III  - 1 2 )  fol¬ 
lows,  and  we  have  for  the  floating  point  case 
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Except  for  perhaps  a  constant  factor,  the  floating  and  fixed  point  results  are  the  same,  since 
2  -  2t 

<jf  is  proportional  to  2  .  Experimental  results  for  the  floating  point  case  are  represented  by 

the  solid  dots  in  Pig.  9,  and  are  observed  to  be  slightly  lower  than  the  results  for  the  fixed  point 

case.^  As  a  partial  accounting  for  this  difference,  we  can  mention  that  in  the  floating  point  pro- 

gram,  the  frequently  used  coefficients  W  -  1  and  — j  were  represented  exactly,  whereas  this 

was  not  the  case  for  our  fixed  point  program.  However,  the  discrepancies  between  the  results 

2  2 

are  very  slight  from  a  practical  viewpoint,  since  a  factor  of  two  difference  in  °'g/cr^  represents 
only  a  half-a-bit  difference  in  rms  output  error. 

The  above  work  was  designed  only  to  give  a  rough  estimate  of  the  errors  to  be  expected,  on 
the  average,  due  to  coefficient  quantization  in  the  FFT.  Many  phenomena,  which  occur  in  spe¬ 
cial  cases,  cannot  be  explained  by  our  statistical  model.  For  example,  it  has  been  observed 
(for  our  decimation  in  time  algorithm,  with  tabled  coefficients)  that  if  x(n)  is  a  single  pulse  at 
n  =  0,  X(k)  will  be  a  constant,  despite  coefficient  quantization.  Similarly,  a  constant  x(n)  yields 
a  single  pulse  at  k  =  0  for  X(k).  It  has  been  found  that  a  single  frequency  input  x(n)  e ^ J 2 zrln/IX ) 
will  yield,  because  of  coefficient  quantization,  spurious  peaks  at  multiples  of  the  input  frequency. 
Also,  it  has  been  observed  that  a  real  input  x(n)  yields  a  transform  whose  real  part  is  even  and 
whose  imaginary  part  is  odd,  despite  the  coefficient  quantization.  Each  of  these  phenomena  can 
be  explained  by  considering  special  symmetries  and  other  details  of  the  actual  FFT  computation. 
However,  the  explanations  are  each  valid  only  for  a  specific  input  and  a  specific  algorithm,  and 
are  not  particularly  enlightening  from  a  general  viewpoint. 

Some  comments  can  be  made  about  the  case  where  coefficients  are  obtained  from  a  recur¬ 
sive  difference  equation,  rather  than  from  a  table  look-up.  First,  one  would  probably  be  wise 

to  use  a  coupled  form,  rather  than  a  direct  form  digital  oscillator,  since  the  former  yields  much 

2 1 

greater  accuracy  in  the  sine  and  cosine  values.  Even  using  the  coupled  form,  one  should  fre¬ 
quently  reset  the  initial  conditions  of  the  oscillator.  As  the  oscillator  is  allowed  to  run  for  longer 
periods,  the  sine  and  cosine  values  obtained  tend  to  become  more  in  error  from  the  desired  val¬ 
ues,  both  because  of  roundoff  error  in  computing  the  recursion  and  (perhaps  more  importantly) 
because  the  coefficients  of  the  recursion  will  be  slightly  in  error,  yielding  an  error  in  output 
frequency.  This  effect  has  been  observed  for  a  2048-point  transform,  in  which  the  digital  oscil¬ 
lator  was  reset  only  at  the  beginning  of  each  stage  of  the  FFT.  By  the  last  stage,  when  the 


t  In  comparing  the  results  as  platted  an  the  some  groph,  it  is  implicitly  ossumed  thot  the  number  of  bits  in  the 
flooting  paint  mantisso  is  equal  to  the  number  of  bits  in  the  fixed  point  fraction.  Actuolly,  the  need  for  on  ex¬ 
ponent  then  implies  greoter  totol  word  length  for  floating  point. 
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recursion  was  run  for  1024  points,  some  errors  in  the  generated  coefficients  were  large  enough 
to  affect  the  last  10  bits  of  these  coefficients.  Significant  distortions  in  computed  transforms 
were  observed  to  result  from  these  errors.  However,  more  frequent  resetting  of  the  initial  con¬ 
ditions  of  the  coefficient -producing  recursions  would  significantly  reduce  this  error. 

B.  ROUNDOFF  NOISE  AND  OVERFLOW  CONSTRAINT  -  FIXED  POINT 

In  this  section,  we  analyze  the  effects  of  roundoff  errors  and  the  overflow  constraint  to 
obtain  formulas  for  output  noise-to-signal  ratio  in  a  fixed  point  FFT.  At  first,  we  ignore  the 
overflow  constraint  and  use  the  general  approach  previously  applied  to  recursive  filters  (see 
Sec.  II-C-1)  to  obtain  formulas  for  output  noise  variance.  Then,  we  consider  the  overflow  con¬ 
straint  and  derive  output  noise-to-signal  ratio  formulas.  We  find  that  if  overflow  is  prevented 
by  attenuating  the  array  at  each  stage  of  the  FFT,  rather  than  simply  by  one  large  attenuation 
of  the  initial  input  array,  the  output  noise-to-signal  ratio  is  improved.  Experimental  verifica¬ 
tion  of  the  predictions  of  the  noise  model  are  presented.  The  results  in  this  section  are  essen- 

27 

tially  the  same  as  the  corresponding  results  reported  by  Welch,  but  the  route  taken  in  the  anal¬ 
ysis  is  somewhat  different. 

First,  we  recall  our  approach  to  analysis  of  fixed  point  roundoff  noise.  On  a  signal  flow 
graph  of  the  system,  we  insert  additive,  signal-independent,  white  noise  sources  after  each  mul¬ 
tiplier.  To  determine  the  effect  of  any  individual  noise  source,  we  find  (via  inspection  of  the 
flow  graph)  a  transfer  function  between  the  noise  source  and  the  system  output,  and  use  linear 
system  noise  theory  to  derive  the  corresponding  output  noise  statistics.  Then,  applying  our 
assumption  that  all  noise  sources  are  independent  of  each  other,  we  determine  overall  output 
noise  statistics  by  simply  adding  the  effects  of  all  the  individual  noise  sources.  In  the  simple 
recursive  filters  studied  earlier,  only  a  few  noise  sources  needed  to  be  considered,  whereas 
in  the  FFT  we  have  many  noise  sources  with  which  to  deal.  An  added  complication  is  that,  since 
the  computation  leading  to  each  FFT  output  point  is  different,  it  might  seem  necessary  to  per¬ 
form  a  separate  noise  analysis  for  each  output  point. 
However,  we  shall  see  that  the  properties  of  this  remark¬ 
able  algorithm  (the  FFT)  are  such  that  these  complica¬ 
tions  can  be  dealt  with  neatly,  and  simple  results  can  be 
obtained. 

Figure  10  displays  an  FFT  flow  graph  for  N  =  8.  A 
specific  decimation  in  time  algorithm  is  depicted.  (An 
implementation  of  this  particular  form  of  the  algorithm 
was  used  for  all  our  experimental  work. )  Some  key  as¬ 
pects  of  this  diagram,  which  are  common  to  all  standard 
radix  2  algorithms,  are  as  follows.  The  DFT  is  computed 
in  u  =  log2N  stages.  At  each  stage,  the  algorithm  passes 
through  the  entire  array  of  N  complex  numbers,  two  at  a 
time,  generating  a  new  N  number  array.  The  u  ^  array 
contains  the  desired  DFT.  The  basic  numerical  compu¬ 
tation  operates  on  a  pair  of  numbers  in  the  m^1  array,  to 

s  t 

generate  a  pair  of  numbers  in  the  (m  +  1 )  array.  This 
computation,  referred  to  as  a  "butterfly,"  is  defined  by 
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Fig.  10.  FFT  flow  graph,  N  =  8. 
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(Ill— 1  5a) 


X  .  (i)  =  X  (i)  +  WX  (j) 
m  +  1  m  m  J 

X  ,.(j)  =  X  (i)-WX  (j)  .  (Ill- 1 5b) 

m  +  1  J  m  m  J 

th  ^ 

Here,  X  (l)  and  Xm(j)  represent  a  pair  of  numbers  in  the  m  array,  and  W  is  some  appropri¬ 
ate  integer  power  of  W,  that  is, 

W  =  Wp  =  e”^,rp^N  .  (111-16) 

At  each  stage,  N/2  separate  butterfly  computations  such  as  (III-15)  are  carried  out  to  produce 
the  next  array.  The  integer  p  varies  with  i,  j,  and  m  in  a  complicated  way,  which  depends  on 
the  specific  form  of  the  FFT  algorithm  that  is  used.  Fortunately,  our  analysis  is  not  tied  to  the 
specific  way  in  which  p  varies.  Also,  the  specific  relationship  between  i,  j,  and  m,  which 
determines  how  we  index  through  the  mth  array,  is  not  important  for  our  analysis. 

Given  an  FFT  flow  graph,  we  can  model  the  roundoff  noise  by  associating  an  independent 
white  noise  generator  with  each  multiplier.  This  means  that  a  noise  source  feeds  into  each  node 
of  the  signal  flow  graph  of  Fig.  10  (excluding  the  initial  array  of  nodes,  since  we  are  not  consid¬ 
ering  A-D  noise  here).  Since  we  are  dealing  with  complex  multiplications,  these  elemental 

2 

noise  sources  are  complex.  Defining  the  complex  variance  as  the  expected  squared  magni¬ 
tude  of  such  a  noise  source,  we  have 
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where  we  have  assumed  that  each  of  the  four  real  multiplications  used  to  perform  the  complex 
multiplication  is  rounded  separately.  In  Fig.  10,  3  x  8  =  24  such  noise  sources  must  be  inserted. 

Our  next  objective  is  to  add  up  the  effects  of  all  these 
noise  sources  to  obtain  the  output  noise  variance.  The 
key  observation  which  simplifies  this  is  that  the  trans¬ 
mission  function  from  any  node  in  the  flow  graph  to  any 
other  connected  node  is  simply  a  multiplication  by  a  com¬ 
plex  constant  of  unity  magnitude  (i.e.,  a  power  of  W). 

Thus,  the  noise  variance  at  any  node  in  the  output  array, 

due  to  a  noise  source  at  any  particular  node  in  the  flow 

2 

chart,  will  be  equal  to  the  noise  variance  if  the  two 

nodes  are  connected,  or  equal  to  zero  if  the  noise  source 

node  is  not  connected  tothis  particular  output  node.  Since 

we  are  assuming  all  noise  sources  uncorrelated,  we  can 

compute  the  noise  variance  at  any  output  node  by  simply 

counting  the  number  of  noise  sources  that  connect  to  this 

2 

node,  and  multiplying  this  number  by  cr^. 

This  count  can  be  made  by  inspection  of  the  flow 
chart,  as  illustrated  in  Fig.  11.  The  output  point  X(2) 
is  singled  out,  and  the  heavy  solid  lines  define  a  "tree" 
whose  branches  double  in  number  as  we  proceed  backward 
from  array  to  array  in  the  flow  chart.  At  the  nodes  of 
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Fig.  11.  FFT  flow  graph,  N  =  8.  Solid 
lines  (tree)  indicate  those  noise  sources 
which  affect  a  particular  output  point. 
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this  tree  are  the  noise  sources  that  connect  to  this  particular  output  point.  The  tree  has 
1  +  2  +  4  =  7  nodes 

so  the  variance  of  the  output  noise  E(2),  at  this  particular  output  point,  is 
6|E(2)|2  _  a2-  7a2  . 

We  can  easily  verify  that  a  similar  tree  drawn  backward  from  any  output  point  will  have  the 
same  number  of  nodes,  so  that 

(7p  =  (1  +2  +4)  <7^  =  7(7j2 

is  the  noise  variance  for  any  output  point.  This  argument  can  be  easily  extended  to  FFT's  for 
larger  numbers  of  points,  since  our  tree  simply  doubles  its  number  of  branches  for  each  addi¬ 
tional  stage.  Thus,  for  N  =  2U,  we  have 

a,2,  =  (1  +  2  +  4  +.  .  .  +  N/2)  <7  2 

or 

a2  =  (N  -  1)  a2 

which,  for  reasonably  large  N,  we  shall  take  as 

a 2  =  NffJ  .  (111-48) 

This  is  the  basic  result  we  have  been  seeking  for  output  noise  variance  in  the  FFT.  It  says 
that  the  variance  of  the  output  noise  is  independent  of  position  in  the  output  array,  and  is  propor¬ 
tional  to  N,  the  number  of  points  transformed.  The  effect  of  doubling  N,  or  adding  another 
stage  in  the  FFT,  is  to  double  the  output  noise  variance.  Using  the  assumptions  we  have  made 
thus  far  about  the  noise  generators  in  the  FFT  (all  uncorrelated,  with  equal  variances),  we  can 
deduce  a  further  result  —  namely,  that  the  output  noise  is  white,  i.e.,  the  N  noise  samples  E(k) 
are  mutually  uncorrelated,  with  independent  real  and  imaginary  parts.  One  can  prove  this  easily 
by  first  being  convinced  that  the  output  of  any  butterfly  is  white  (two  outputs  uncorrelated  with 
equal  variance,  real  and  imaginary  parts  uncorrelated)  if  the  input  is  white.  Since  the  noise 
sources  in  our  system  are  white,  and  all  connected  to  the  output  via  some  combination  of  butter¬ 
fly  computations,  the  output  noise  must  also  be  white. 

In  order  to  simplify  the  analysis  leading  to  (111-18),  we  have  neglected  some  details.  First, 
we  have  associated  equal  variance  noise  sources  with  all  multipliers,  including  where  W  =  land  j. 
In  many  programmed  FFT's  these  multiplications  are  performed  noiselessly.  If  we  assume  in 
the  analysis  that  these  multiplications  are  noiseless,  the  output  noise  variance  will  no  longer 
be  uniform  over  the  output  array.  (For  example,  the  zeroth  output  point  would  be  noiseless.) 

The  average  variance  over  the  output  array  will  be  somewhat  lower  than  the  result  in  (III  — 18), 
but  will  retain  a  linear  dependence  on  N.  Second,  the  assumption  that  all  noise  sources  are  un¬ 
correlated  is  contradicted  by  the  fact  that  the  two  noise  sources  associated  with  a  given  butterfly 
are  negatives  of  each  other,  and  therefore  completely  correlated.  This  does  not  affect  the  re¬ 
sult  for  output  noise  variance,  since  the  two  outputs  of  a  butterfly  connect  to  a  disjoint  set  of 
output  points.  However,  it  implies  that  the  output  noise  samples  E(k)  are  somewhat  correlated. 
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We  feel  that  these  details  are  worth  mentioning,  but  not  worth  analyzing  here  at  length,  because 
they  cloud  the  essential  ideas  of  the  analysis,  are  quite  program-dependent,  and  do  not  change 
the  essential  character  of  the  dependence  of  mean-squared  output  noise  on  N. 

We  now  wish  to  obtain  a  formula  for  output  noise-to-signal  ratio  in  an  FFT,  by  considering 
the  overflow  constraint  in  conjunction  with  our  noise  analysis.  We  can  insure  against  overflow 
in  the  FFT  computation  by  keeping  the  input  x(n)  sufficiently  small  so  that  no  output  X(k)  exceeds 
unity  in  magnitude.  The  maximum  modulus  of  the  complex  numbers  in  the  array  is  nondecreas¬ 
ing  as  we  proceed  from  stage  to  stage,  since  we  can  deduce  easily  from  ( III- 15)  that 

max  (  lXm(i)l’  I  Xm(^  I  )  <  max  {lxm+l(i)l’ 

<  2  max  {|Xm(i)|,  |Xm(j)|}  .  (HI-19) 

Thus,  the  constraint  |X(k)|  <  1  guarantees  also  that  there  will  be  no  overflow  in  the  computation 
of  an  intermediate  array. 

In  order  to  express  this  constraint  as  a  bound  on  the  input  sequence,  we  observe  (from  the 
definition  of  the  DFT)  that  the  maximum  possible  output  can  be  expressed  in  terms  of  the  max¬ 
imum  input  as 

N-l 

lX(k)lmax=  Nn)lmax  Z  I Wnk |  -  N|x(n)|max  .  (III-20) 

n=0 

Thus,  bounding  the  input  sequence  so  that 

|  x(n)  |  <  1/N  (III-21) 

will  prevent  overflow.  To  obtain  an  explicit  expression  for  output  signal  variance,  we  assume 
x(n)  white,  with  real  and  imaginary  parts  each  uniformly  distributed  in  (  —  N,  \/NrZ  N). 

Then,  we  have 

<?£  =  6|x(k)|2  =  N<tx2  £  N6|x(n)|2  =  .  ( III— 22 ) 


By  combining  this  with  (111-18), 


3N  24 


(111-23) 


The  assumption  of  white  input  signal  is  not  critical  here.  For  example,  if  a  complex  sinusoid 

2  2  2 

x(n)  =  (l/N)  exp  [ i ( 2 7r k  /N  +  (p) ]  had  been  selected,  cr_/crY  would  still  be  proportional  to  N  ,  which 

O  H/  A 

is  the  essential  point  of  (III- 23). 

We  can  be  slightly  more  clever  about  the  way  we  prevent  overflow  in  the  FFT,  and  obtain  a 

2 

noise-to-signal  ratio  proportional  to  N,  instead  of  to  N  .  We  observe  [see  (111-19)]  that  the  max¬ 
imum  modulus  of  the  numbers  in  an  array  increases  gradually,  never  by  more  than  a  factor  of 
two,  as  we  proceed  from  stage  to  stage.  Thus,  we  can  prevent  overflow  by  incorporating  an 
attenuation  factor  of  1/2  at  each  stage,  and  requiring  for  the  input  array  that  |x(n)|  <  1.  The 
attenuation  may  be  carried  out  previous  to  the  computation  of  each  succeeding  stage,  or  may  be 
incorporated  into  the  butterfly  computation.  This  step-by-step  scaling  tends  to  keep  the  signal 
level  as  high  as  possible  (compared  with  the  roundoff  errors)  as  we  proceed  from  stage  to  stage, 
and  thus  might  be  expected  to  yield  lower  noise-to-signal  ratio  than  when  we  simply  scale  the 


45 


input  arrav  so  that  lx(n)|  <  l/N.  This  is  indeed  the  case.  Using  step-by-step  scaling,  the  at¬ 
tainable  output  signal  level  (for  white  Input)  is  the  same  as  in  (III-22),  sinee  the  output  signal 
level  does  not  depend  on  where  the  scaling  is  done,  but  only  on  how  much  overall  scaling  is  done. 
However,  with  step-by-step  scaling,  the  output  noise  level  will  be  less  than  in  ( III— 18),  sinee 
the  noise  introduced  at  early  stages  of  the  FFT  will  be  attenuated  by  the  scaling  which  takes 
plaee  at  the  later  array.  To  consider  this  quantitatively,  let  us  refer  baek  to  Fig.  11.  The  tree 
still  enables  us  to  determine  those  noise  sources  which  affeet  a  given  output  point,  but  now  atten¬ 
uation  factors  of  l/2  must  be  included  on  all  the  arrows  of  the  flow  graph  in  order  to  determine 
the  output  noise  variance  resulting  from  any  noise  source.  For  example,  a  noise  souree  in  the 
seeond-from-last  array  will  go  through  an  attenuation  of  1/4  in  propagating  to  the  output,  result- 

■y 

ing  in  an  output  noise  variance  which  is  (l/4P  l/l6  times  the  variance  of  the  noise  source. 

For  the  N  -  8  case  depicted,  the  output  noise  variance  is 


„  2  _  , .  .  1  .2  ,  1  .2,  2 

[1  +  2(  ^  )  4  4(  ^  )  ]  cr-p, 

2 

where  <r„,  represents  the  noise  souree  variance.  We  notice  that  although  earlier  arrays  eontrib- 

2 

ute  a  larger  number  of  noise  sources  to  a  given  output  point,  their  actual  contribution  to  is 
less  than  that  of  later  arrays,  because  of  the  scale  factors. 

For  the  general  case  N  =  2U,  we  can  deduce  that 


2  r, 

=  t1 


+  2(  j)Z  +  4(  j)2  + 


—  (  — )Z] 
2  '  N  1  J 


or  approximately  (for  large  N) 


cr 


2 

E 


2  a 


13’ 


(111-24) 


Thus,  for  reasonably  large  N,  the  output  noise  variance  does  not  increase  with  N,  and  is  much 

2  2 

less  than  the  variance  in  (111-18).  Generally,  cr^,  will  be  slightly  higher  than  cr^,  since  it  must 
include  the  noise  due  to  the  scaling  as  well  as  the  noise  due  to  the  multiplication  by  W. 

We  assume  that  the  scaling  is  accomplished  by  rounded  right  shift  where,  if  the  bit  to  be 
shifted  off  is  1,  a  random  choiee  is  made  as  to  whether  to  round  the  result  up  or  down.  The 
rounding  error  is  0  with  probability  1/2,  -(l/2)2-t  with  probability  1/4,  and  (l/2)2_t  with  prob¬ 
ability  1/4.  Its  variance  is 


cr 


2 

s 


(III-2S) 


Since  shifts  of  both  real  and  imaginary  parts  are  needed,  we  have 


cr 


2 

B’ 


=  4 


+  4 


=  (5/6)2'Zt 


(III  —  26) 


■p  — 

slightly  greater  than  cr^  =  (4/12)2 

Now,  we  can  eombine  (111-24)  with  (III-22)  to  obtain  the  output  noise-to-signal  ratio  for  the 
case  of  step-by-step  sealing.  We  obtain 

CT  2 

— §■  =  6NctZ,  =  (5N)2_Zt  (111-27) 

ffX 

2 

a  result  proportional  to  N,  rather  than  to  N  .  An  interpretation  of  (111-27)  is  that  the  rms  output 
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noise-to-signal  ratio  increases  as  \Tn,  or  by  half  a  bit  per  stage.  This  result  has  been  stated 
27 

by  Welch.  It  is  important  to  note  that  the  assumption  of  white  signal  is  not  essential  in  the  anal¬ 
ysis.  The  basic  result  of  half-a-bit-per-stage  increase  holds  for  a  broad  class  of  signals,  with 
only  the  constant  multiplier  in  (III-27)  being  signal-dependent.  In  general,  with  scaling  at  each 
array,  the  output  variance  is  related  to  the  variance  of  the  input  array  by 

=  W  ffx  =  6!x(n)|2  (HI— 28) 

so  that 


ffE  (5/3)N2_Zt 

2  2 

a  „  <j 

X  x 


(I  II -29 ) 


2 

where,  to  reduce  noise-to-signal  ratio,  we  would  like  to  make  as  large  as  possible  but  are 
limited  by  the  constraint  |x(n)|  <  1. 

2  2 

We  should  also  note  that  the  dominant  factor  causing  the  increase  of  tfj^/o''/  with  N  is  the 
decrease  in  signal  level  (required  by  the  overflow  constraint)  as  we  pass  from  stage  to  stage. 
According  to  ( III -24)  and  (III-26),  very  little  noise  (only  a  bit  or  two)  is  present  in  the  final  array. 
Most  of  the  noise  has  been  shifted  off  by  the  scalings.  However,  the  mean-squared  signal  level 
has  decreased  by  a  factor  of  l/N  from  its  initial  value,  due  to  the  scalings.  Our  output  consists 
not  of  the  DFT  defined  by  ( III -2 ) ,  but  of  l/N  times  this  DFT. 

We  have  assumed  straight  fixed  point  computation  in  this  section,  i.e.,  only  preset  attenua¬ 
tions  were  allowed,  and  we  were  not  permitted  to  rescale  on  the  basis  of  an  overflow  test. 

Clearly,  if  our  hardware  or  programming  facility  are  such  that  straight  fixed  point  must  be  used, 
we  should  if  possible  incorporate  attenuators  of  1/2  at  each  array,  rather  than  using  a  large  at¬ 
tenuation  of  the  input  array. 

In  Sec.  C  below,  we  shall  consider  the  effect  of  roundoff  noise  on  a  block  floating  point  FFT, 
where  the  computations  are  continually  checked  for  overflow,  and  a  scaling  of  the  array  is  car¬ 
ried  out  only  when  an  overflow  occurs.  For  some  inputs,  scaling  will  not  be  necessary  at  all 
stages  of  the  FFT,  so  that  the  signal  level  and  signal -to-noise  ratio  can  be  kept  higher  than  in 
fixed  point.  Our  fixed  point  analysis  just  presented  can  be  considered  as  a  worst -case  analysis 
for  block  floating  point. 

The  predicted  output  noise-to-signal  ratio,  as  given  in  <111-29 ),  has  been  checked  exper¬ 
imentally.  Test  sequences  were  transformed  twice  —  once  using  36-bit  arithmetic,  and  once 
using  rounded  arithmetic  with  much  shorter  word  length  (e.g.,  12  bits).  The  results  were  sub¬ 
tracted,  squared,  and  averaged  to  obtain  an  estimate  of  noise  variance.  Both  white  and  sinusoi¬ 
dal  inputs  were  used  as  test  signals.  Figure  12  displays  the  results,  plotting  on  a  log-log  scale 

2  -2t  2  *1/2 

normalized  rms  output  noise-to-signal  ratio  (a^/2  a^)1'  vs  N.  A  theoretical  curve  (straight 

line)  corresponding  to  ( II 1-2 9)  is  shown.  Since  the  white  inputs  used  had  real  and  imaginary  parts 
uniformly  distributed  in  (—1,  1),'  with  <7^  =  2/3,  the  theoretical  curve  used  was 


=  (5/2)1/2N1/2 


(III  —  30 ) 


t  Actually,  a  distribution  in  (—  \/s/2,  l/\/2)  should  have  been  used,  since  the  distribution  which  was  used  does 
not  guarantee  |x(n)|  <  1,  making  it  possible  (though  not  probable)  that  an  overflow  can  occur  despite  shifts  at 
every  array.  However,  such  an  overflow  did  not  occur  in  any  of  our  experimental  runs. 
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Fig.  12.  Theoretical  and  experimental  noise-ta-signal  ratios  for  fixed  paint  FFT, 
with  rounded  arithmetic. 


N 

Fig.  13.  Experimental  noise-ta-signal  ratios  far  fixed  paint  FFT,  with  truncation 
instead  af  raunding. 
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Experimental  results  for  white  noise  inputs  are  plotted,  where  for  each  N  the  results  for  sev¬ 
eral  white  inputs  were  averaged  to  yield  a  stable  output  noise  variance  measurement.  We  also 
show  experimental  results  for  sinusoidal  inputs,  of  the  form  x(n)  =  exp  [j(27rqn/N)  +  ip],  For  the 
sinusoidal  inputs,  =  1,  so  that  the  correct  theoretical  curve  is 


=  ( 5/3)1'/2N1//2 


The  actual  experimental  results  were  multiplied  by  [(5/2)/(  5/3)]^  -  (3/2)1  ^ ,  so  that  they 

could  be  conveniently  plotted  alongside  the  already  drawn  theoretical  curve  of  (III-30).  For  each 
N,  the  plotted  results  represent  an  average  over  several  input  frequencies. 

For  both  white  and  sinusoidal  inputs,  the  agreement  of  theory  with  experiment  is  excellent, 
and  certainly  justifies  the  "half-bit-per-stage"  rule  for  the  increase  of  noise-to-signal  ratio. 

The  fact  that  the  experimental  results  lie  consistently  below  the  theoretical  curve,  more  so  for 
lower  N,  can  be  partially  explained  by  the  fact  that  multiplications  by  1  and  j  introduce  less 
roundoff  noise  than  assumed  in  our  model. 

An  issue  of  some  practical  importance  is  the  question  of  how  our  results  would  change  if 
truncation  rather  than  rounding  were  used.  This  question  has  been  investigated  experimentally 
and  the  results  are  shown  in  Fig.  13.  Although  the  actual  values  of  noise-to-signal  ratio  are 
higher  than  for  rounding,  the  slope  of  the  curve  remains  essentially  the  same.  This  might  be 
expected,  since  the  half-bit-per-stage  slope  is  determined  chiefly  by  the  signal  attenuation  (same 
for  rounding  and  truncation)  and  not  by  the  noise  accumulation.  The  somewhat  greater  output 
noise  for  truncation  is  to  be  expected,  due  to  the  fact  that  a  truncation  noise  source  has  a  slightly 
higher  mean-square  than  a  rounding  noise  source,  and  perhaps  also  due  to  the  correlation  of 
truncation  noise  with  signal  sign. 

When  rounded  arithmetic  was  used,  except  that  the  results  of  right  shifts  scaling  the  array 
were  consistently  rounded  up  (rather  than  rounded  randomly  up  or  down),  it  was  found  that  the 
noise  was  somewhat  higher  than  with  randomized  rounding,  but  lower  than  when  truncation  was 
used  throughout  the  FFT. 


C.  ROUNDOFF  NOISE  -  BLOCK  FLOATING  POINT 

A  block  floating  point  FFT  can  be  carried  out  in  the  following  manner:  the  original  array 
is  normalized  to  the  far  left  of  the  computer  word,  with  the  restriction  that  |x(n)|  <  1;  the  com¬ 
putation  proceeds  in  a  fixed  point  manner,  except  that  after  every  addition  there  is  an  overflow 
test;  if  overflow  is  detected,  the  entire  array  is  shifted  right  one  bit  and  the  computation  con¬ 
tinues.  The  number  of  necessary  shifts  are  counted  to  determine  a  scale  factor  or  exponent 
for  the  entire  final  array.  The  output  noise-to-signal  ratio  depends  strongly  on  how  many  over¬ 
flows  occur,  and  at  what  stages  of  the  FFT  they  occur.  The  positions  and  timing  of  overflows 
are  determined  by  the  signal  being  transformed,  and  thus,  in  order  to  analyze  noise-to-signal 
ratio  in  block  floating  FFT,  one  needs  to  know  the  signal  statistics.  This  is  in  contrast  to  the 
fixed  point  analysis  above,  where  it  was  not  necessary  to  assume  specific  signal  statistics. 

The  necessary  number  of  right  shifts  of  the  array  is  related  to  the  peakiness  of  the  DFT  of 
the  signal  being  transformed.  If  the  constant  signal  x(n)  =  1  or  the  single-frequency  input 
x(n)  =  exp  [j(27r/N)  kQn]  is  transformed,  the  output  (with  kQ  an  integer)  will  consist  of  a  single 
nonzero  point,  and  (for  N  =  2U)  v  scalings  of  the  array  will  be  necessary,  one  for  each  stage. 
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For  these  maximally  peaked  spectra,  block  floating  point  computation  reduces  to  the  fixed  point 
computation  described  above.  Generally,  the  more  scalings  of  the  array  that  are  needed,  the 
worse  will  be  the  output  noise-to-signal  ratio,  so  that  the  fixed  point  analysis  of  the  preceding 
section  (where  scalings  at  every  stage  were  included)  can  be  considered  to  be  a  worst-case  anal¬ 
ysis  for  block  floating  point.  In  choosing  a  register  length  for  a  system  (e.g.,  an  FFT  filter) 
using  block  floating  point  FFT,  one  would  probably  be  wise  to  choose  on  the  basis  of  a  worst- 
case  analysis.  However,  an  overflow  test  and  adaptive  scaling  procedure  is  usually  simple  to 
include  in  a  computer  program,  and  the  valid  question  remains  as  to  how  much  better  than  the 
worst  case  (fixed  point)  one  might  expect  to  do  using  a  block  floating  point  strategy.  Here,  we 
analyze  block  floating  point  FFT  noise-to-signal  ratio  for  the  case  of  a  white  input  signal.  The 
DFT  of  a  white  signal  is  also  white,  and  one  might  expect  (since  the  spectral  energy  is  spread, 
rather  than  peaked  at  a  single  point)  that  scalings  at  all  the  stages  would  not  be  necessary,  and 
a  noise-to-signal  ratio  advantage  over  fixed  point  would  be  gained. 

A  general  scheme  for  analyzing  this  problem  is  as  follows.  First,  we  need  to  determine  the 
probabilities  of  the  various  possible  scaling  patterns  for  a  transform  of  a  white  signal.  We  refer 

Q 

to  a  scaling  pattern  by  a  notation  such  as  SSSNSNSNS,  meaning  that  scales  (in  a  2  -point  trans¬ 
form)  take  place  at  stages  1,  2,  3,  5,  7,  and  9,  with  no  scalings  at  the  remaining  stages.  We 
might  determine  the  probability,  or  frequency  of  occurrence,  of  given  scaling  patterns  either 
theoretically  or  via  an  experimental  histogram  obtained  by  transforming  many  white  sequences, 
and  recording  where  the  scales  occur.  For  any  particular  scaling  pattern,  we  could  estimate 
the  output  noise-to-signal  ratio  by  straightforward  application  of  the  techniques  of  the  preceding 
section.  The  overall  expected  output  noise-to-signal  ratio  for  block  floating  point  would  be  de¬ 
termined  as  an  average  of  the  noise-to-signal  ratios  for  the  various  scaling  patterns,  weighted 
by  the  probability  of  these  patterns. 

Let  us  briefly  illustrate  a  theoretical  approach  to  finding  probabilities  of  scaling  patterns. 
Consider  a  white  input  sequence  x(n),  whose  real  and  imaginary  parts  are  each  uniformly  dis¬ 
tributed  in  (  —  1,  1).  We  wish  to  find  the  probability  of  an  overflow  in  the  first  array  of  computa¬ 
tions.  Since,  in  computing  the  first  array  (see  Fig.  10),  only  additions  and  subtractions  are 
involved,  each  random  variable  (real  or  imaginary)  obtained  is  triangularly  distributed  in  (  —  2,  2). 
An  overflow  will  occur  unless  all  these  2N  variables  lie  inside  (—1,  1).  Thus,  the  probability  of 
overflow  is 

P(overflow)  =  1  —  (3/4)^ 

since  3/4  is  the  probability  that  any  one  variable  lies  inside  (—1,  1).  Note  that  for  any  reason¬ 
ably  large  value  of  N,  P(overflow)  for  the  first  stage  is  very  close  to  1.  In  trying  to  pursue 

9 

this  kind  of  analysis  through  all  the  stages  of,  say,  a  2  -point  FFT,  the  work  quickly  becomes 
quite  cumbersome.  In  later  stages,  the  distribution  of  the  variables  in  the  array  is  not  so  easily 
determined.  Even  if  we  assume  that  these  variables  become  Gaussian,  the  task  is  still  quite 
difficult  due  to  the  large  number  of  possible  scaling  patterns  which  must  be  considered;  for  a 
u-stage  transform,  we  must  consider  Zv  possible  scaling  patterns/ 

t  Here,  we  ore  assuming  that  o  maximum  af  ane  shift  is  necessary  at  any  given  stage.  This  ossumptian  would  be 
true  if  scaling  were  controlled  by  the  modulus  af  the  complex  numbers  [see  (111-19)].  But,  normally,  we  test  far 
overflow  in  the  computotian  af  real  and  imaginary  parts,  and  in  this  cose  it  is  possible  (though  quite  unlikely) 
ta  get  up  ta  two  overflows  in  a  stage,  or  a  single  overflow  af  up  ta  2  bits.  We  will  ignore  this  possibility  be¬ 
cause  it  has  not  been  observed  in  ony  af  aur  experimental  runs,  ond  unduly  complicates  the  analysis. 
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Because  of  these  difficulties,  we  have  resorted  to  experimental  histograms  to  estimate  the 
probabilities  of  the  scaling  patterns.  Fortunately,  it  turns  out  that  most  of  the  2V  possible  scal¬ 
ing  patterns  are  highly  improbable  and  can,  for  practical  purposes,  be  neglected.  For  example, 

Q 

in  FFT's  of  500  different  2  -point  white  sequences  [Re  and  lm  parts  uniform  in  (-  1.  1)],  one  of 
eight  specific  scaling  patterns  occurred  in  499  of  the  cases.  These  patterns,  together  with  their 
empirical  probabilities,  are 


SSSNSNSNS 

P  = 

0.24 

SSSNSNSSN 

P  = 

0.16 

SSSNSSNNS 

P  = 

0.15 

SSSNSSNSN 

P  = 

0.10 

SSSSNNSNS 

P  = 

0.13 

SSSSNNSSN 

P  - 

0.08 

SSSSNSNNS 

P  = 

0.08 

SSSSNSNSN 

P  = 

0.05 

Note  that  in  each  case  a  total  of  6  scalings  are  necessary,  3  fewer  than  for  a  fixed  point  com¬ 
putation.  Thus,  we  would  expect  a  noise-to-signal  ratio  improvement  over  that  in  the  fixed  point 
case. 

We  can  show  theoretically  that,  for  these  512-point  transforms,  the  probability  of  exactly 
6  scales  is  very  close  to  unity.  Our  initial  array  x(n)  consists  of  1024  (considering  real  and 

imaginary  parts)  uniformly  distributed,  independent  random  variables  with  variance  a  2  1/3. 

r 

If  no  scalings  were  incorporated,  the  I)FT  X(k)  would  consist  of  1024  approximately  Gaussian 

(by  the  central-limit  theorem),  independent  random  variables  with  variance  a  2  =  512/3.  Let 

'  r 

|  X’r  I  max  denote  the  magnitude  of  that  variable  (real  or  imaginary),  in  the  DFT  array,  which  has 
the  largest  magnitude.  The  probability  that  exactly  6  scalings  will  be  needed  is  given  by 

P(exactly  6  scalings)  =  P(25  <  |X  j  <  2^) 
j  6  r 1  max 


We  have 


X  < 

1  r  1  max 


26!  =  P[a"  |Xr|  <  26,  =  [l 
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Evaluating  this  numerically,  we  obtain 


Similarly, 


P[  X  <  2°]  ~  0.99901398 

1  1  r  max  1 


P[  X  <  23]  =  |1 
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and  thus 


P[25  < 


X. 


<  2 


=  P[  |X  I  <  26]  -  P[  |X  I  <  25] 
1  !  r  max  '  1  1  r 1  max  ‘ 


-  P[exactly  6  scalings]  ~  0.9990136 

Thus,  we  have  verified  theoretically  that  the  probability  of  exactly  6  scalings  is  very  close  to 
unity.  This  technique  can  be  used  to  estimate  the  total  number  of  necessary  scalings,  for  other 
values  of  N. 

q 

We  can  use  empirical  scaling  pattern  probabilities  {such  as  those  given  above  for  N  =  2  )  to 
predict  average  output  noise-to-signal  ratio.  For  any  given  scaling  pattern,  the  output  noise  var¬ 
iance  can  be  derived  using  the  same  techniques  as  for  fixed  point,  with  slight  modifications.  In 
computing  the  output  noise  variance  at  a  particular  output  point  due  to  a  particular  noise  source 
which  is  connected  to  that  output  point,  we  multiply  the  noise  source  variance  by  (1/2)  ,  where 

S  is  the  total  number  of  scalings  following  the  introduction  of  the  noise  source.  The  noise  source 
variance  itself  will  be  ffg  or  CTg,  (see  preceding  section),  depending  on  whether  or  not  the  noise 
due  to  a  scaling  needs  to  be  included.  For  any  scaling  pattern,  the  output  signal  variance  is  re¬ 
lated  to  the  input  signal  variance  by 
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where  S^Qtaj  is  the  total  number  of  scalings.  Once  we  have  determined  ^g/0^  for  all  the 

reasonably  probable  scaling  patterns,  we  determine  an  overall  "theoretical"  c r^/ov  as  an  aver- 

2  2  A. 

age  over  the  various  patterns  of  for  a  pattern,  weighted  by  the  probability  of  occurrence 

of  this  scaling  pattern. 

For  values  of  N  from  32  to  2048,  experimental  probabilities  of  scaling  patterns  have  been 
obtained  (for  white  inputs).  By  using  these  patterns  and  the  procedure  just  described,  "theoret¬ 
ical"  output  noise-to-signal  ratios  have  been  derived.  The  computations  involved  in  deriving 
these  results  are  straightforward  but  lengthy  and  are  omitted.  In  the  derivations,  account  was 
taken  of  the  fact  that  multiplications  by  W  =  1  and  j  result  in  lower  roundoff  noise  variance  than 
for  other  values  of  W  (although  neglecting  this  fact  changes  the  results  very  little).  These  the¬ 
oretical  results  are  represented  as  a  curve  (line  segments  joining  the  computed  points)  on 
Fig.  14.  Also  presented  are  experimentally  measured  values  of  output  noise-to-signal  ratio  for 
block  floating  point  FFT's  of  white  input,  using  rounded  arithmetic  (which  was  assumed  in  the 
analysis).  The  experiments  and  theory  agree  closely.  For  comparison,  a  theoretical  curve  rep¬ 
resenting  fixed  point  noise-to-signal  ratio  (for  rounded  arithmetic)  is  also  shown.  We  see  that 
for  this  kind  of  input  (white),  block  floating  point  provides  some  advantages  over  fixed  point,  espe¬ 
cially  for  the  larger  transforms.  For  N  =  2048,  the  rms  noise-to-signal  ratio  for  block  floating 
point  is  about  1/8  that  of  fixed  point,  representing  a  3-bit  improvement. 

An  experimental  investigation  was  used  to  examine  how  the  results  for  block  floating  point 
change,  when  truncation  rather  than  rounding  is  used.  The  results  of  this  experiment  are  also 
shown  on  Fig.  14.  Noise-to-signal  ratios  are  generally  a  bit  or  two  worse  than  for  rounding. 

The  rate  of  increase  of  noise-to-signal  ratio  with  N  seems  to  be  about  the  same  as  for  rounding. 
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Fig.  14.  Experimental  and  theoretical  noise-to-signal  ratios  for  block  floating  point  FFT. 


D.  ROUNDOFF  NOISE  -  FLOATING  POINT 

In  this  section,  a  statistical  model  for  roundoff  errors  (as  specified  in  Sec.  I-C-2)  is  used 
to  predict  output  noise-to-signal  ratio  when  an  FFT  is  computed  using  floating  point  arithmetic. 
The  result,  derived  for  the  case  of  white  input  signal,  is  that  the  ratio  of  output  noise  variance 

to  output  signal  variance  is  proportional  to  v=  log-  N,  where  N  is  the  number  of  points  trans- 

^  28 

formed.  This  predicted  result  is  significantly  lower  than  bounds  previously  derived  on  mean- 

2 

squared  output  noise-to-signal  ratio,  which  are  proportional  to  v  .  That  the  statistical  result 
increases  more  slowly  with  v  than  the  bound  is  certainly  not  surprising,  since  in  a  statistical 
analysis  we  assume  that  all  noise  sources  in  the  system  add  as  independent  random  variables, 
whereas,  to  obtain  a  deterministic  bound,  we  must  assume  complete  correlation  between  the 
noise  sources. 

The  model  applies  to  rounded  arithmetic,  and  experimental  noise  measurements  closely 
corroborate  the  theory  when  rounded  arithmetic  is  used  in  the  experiments.  Even  for  nonwhite 
(e.g.,  sinusoidal)  signals,  experimental  noise-to-signal  ratios  are  observed  to  follow  a  linear 

dependence  on  v.  These  experimental  results  seem  to  disagree  with  those  obtained  by  Gentleman 

28  2 
and  Sande,  which  indicated  that  the  noise-to-signal  ratio  increased  as  v  rather  than  as  v. 

Most  of  their  experimental  results  pertained  to  truncation  rather  than  rounding,  and  we  have 

also  verified  that,  for  truncation,  a  quadratic  dependence  on  o  fits  the  experimental  data 

more  closely  than  a  linear  dependence.  In  some  of  their  experiments,  however,  rounding  was 
2 

used  and  a  v  dependence  still  was  observed.  But  our  rounding  procedure  included  a  random 
decision  as  to  whether  to  round  up  or  down,  whe'n  a  computed  result  lay  midway  between  two 
machine  numbers.  We  found  that  when  this  random  rule  was  not  used,  but  was  replaced  by  an 
upward  round  of  the  mantissa  in  this  midway  situation,  the  experimental  noise-to-signal  ratios 
increased  significantly,  and  could  indeed  be  better  fitted  with  a  quadratic  than  with  a  linear  curve. 
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This  interesting  observation  is  a  possible  explanation  for  the  apparent  discrepancy  between  our 
experimental  results  and  those  of  Gentleman  and  Sande.  The  midway  situation  occurs  quite  fre¬ 
quently  in  floating  point  computation,  especially  for  addition  of  numbers  of  the  same  order  of 
magnitude,  where  the  unrounded  mantissas  are  very  often  only  1  bit  longer  than  the  rounded 
mantissas.  Always  rounding  the  mantissa  up  in  this  situation  introduces  correlation  between 
roundoff  error  and  signal  sign,  which  contradicts  the  assumption  that  roundoff  errors  are  signal- 
independent.  This  correlation  seems  to  be  enough  to  cause  the  output  noise-to-signal  ratio  to 
increase  significantly  faster  than  u. 

1.  Statistical  Analysis 

Our  basic  result  for  output  noise-to-signal  ratio  in  a  floating  point  FFT  of  a  white  input  sig¬ 
nal  is 


v 


(111-31) 


2 

where  a  is  the  variance  of  a  floating  point  roundoff  variable  (see  Sec.  I-C-2).  In  the  next  few 
paragraphs,  we  give  a  rather  heuristic  derivation  of  this  result. 

I.et  us  refer  back  to  the  FFT  flow  graph  of  Fig.  11.  In  the  floating  point  case,  as  in  the 
fixed  point  case,  we  can  model  the  roundoff  errors  by  introducing  noise  generators  at  all  the 
nodes  of  the  flow  graph.  Postulating  that  all  roundoffs  are  independent  of  each  other  and  of  the 
signal,  we  assume  that  all  the  noise  sources  injected  at  the  various  nodes  are  mutually  uncor¬ 
related,  and  uncorrelated  with  the  signal.  Then,  given  the  variances  of  the  noise  sources,  our 
procedure  for  determining  output  noise  variance  is  identical  to  the  fixed  point  case.  We  draw 
a  tree,  such  as  that  shown  in  Fig.  11,  to  indicate  all  the  noise  sources  which  affect  a  given  out¬ 
put  point.  Then,  since  the  transmissions  on  the  arrows  of  the  FFT  flow  graph  all  have  unity 
magnitude,  the  resultant  noise  variance  at  the  output  point  of  interest  is  simply  the  sum  of  the 
variances  of  the  noise  sources  injected  at  the  nodes  of  the  tree. 

For  the  fixed  point  case,  computing  this  resultant  variance  involved  simply  counting  the 
nodes  of  the  tree,  since  all  noise  sources  had  equal  variance.  But  for  the  floating  point  case, 
all  noise  sources  do  not  have  equal  variance,  since  the  variance  of  a  roundoff  noise  source  is 
proportional  to  the  variance  of  the  signal  being  rounded,  and  the  signal  variance  changes  as  we 
proceed  from  array  to  array  in  the  FFT  computation.  This  change  is  easily  analyzed  for  the 
case  of  a  white  signal,  where  in  passing  from  one  stage  to  the  next  the  signal  remains  white, 
and  its  variance  simply  doubles  [this  can  be  verified  using  (III -4  5)].  Thus,  for  white  signals, 
the  roundoff  noise  source  variances  are  equal  for  a  given  array,  but  double  as  we  pass  from 
array  to  array. 

The  noise  source  variances  double  as  we  pass  from  stage  to  stage,  but,  as  can  be  seen  from 
the  tree  of  Fig.  11,  the  number  of  noise  sources  affecting  a  given  output  point  halves  as  we  pro¬ 
ceed  forward  from  stage  to  stage.  These  effects  exactly  offset  one  another,  so  that  the  total 
contribution  to  output  noise  variance  from  the  noise  sources  at  any  particular  array  does  not 
vary  with  the  position  of  the  array.  Since  we  have  a  total  of  i>  arrays,  the  output  noise  variance 
is  proportional  to  v  as  indicated  in  ( III  —  31). 

To  obtain  the  explicit  result  stated  in  (III-31),  we  need  to  find  the  noise  variance  at  an  out¬ 
put  point,  due  to  the  noise  sources  at  any  particular  array.  Let  us  consider  the  first  array  (the 
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2  '  2 

array  following  the  input  array).  If  we  assume  a  variance  cr^  £  £>|x(n)|  for  the  input,  then  we 


can  show  via  our  floating  point  roundoff  model  that  the  variance  of  the  first  array  of  noise  sources 
is 


(III- 32) 


v- 1 

From  the  tree,  we  find  that  N/2,  or  2  of  these  noise  sources  affect  any  given  output  point, 
so  that  the  resulting  output  noise  variance  is 


( III  —  3  3 ) 


1 


Since  there  are  v  arrays,  all  contributing  equally  to  the  output  noise  variance,  the  total  output 
noise  variance  is 


2  0  2  ,0u  2. 

cte  =  2 u(2  ax  ) 


By  noting  that  the  output  signal  variance  is  related  to  the  input  signal  variance  by 


the  result  follows: 


[(III  —  34)] 


2 

X 


A  further  result,  which  can  be  derived  from  our  model,  is  an  expression  for  the  final  ex¬ 
pected  output  noise-to-signal  ratio  which  results  after  performing  an  FIT  and  an  inverse  FFT 
on  a  white  signal  x(n).  The  inverse  FFT  introduces  just  as  much  roundoff  noise  as  the  FFT  it¬ 
self,  and  thus  we  can  convince  ourselves  that  the  resulting  output  noise-to-signal  ratio  is 


(III—  34a ) 


or  just  double  the  result  in  (III -31). 

In  order  to  see  the  implications  of  (II I  -  3 1 )  or  (III-34a)  in  terms  of  register  length  require¬ 
ments,  it  is  useful  to  express  these  results  in  units  of  bits.  Corresponding  to  (III- 31),  we  use 


(III  —  34b) 


to  represent  the  number  of  bits  by  which  the  rms  noise-to-signal  ratio  increases  in  passing 
through  a  floating  point  FFT.  For  example,  for  u  =  8  this  represents  2  bits,  and  for  v  =  II  it 
represents  2.23  bits.  The  number  of  bits  of  rms  noise-to-signal  ratio  increases  as  log^  (log2N), 
so  that  doubling  the  number  of  points  in  the  FFT  produces  a  very  mild  increase  in  output  noise, 
significantly  less  than  the  half-bit-per-stage  increase  for  fixed  point  computation.  In  fact,  to 
obtain  a  half-bit  increase  in  the  result  above,  we  would  have  to  double  v,  or  square  N. 

A  detail  in  our  analysis  which  should  probably  be  spelled  out  more  carefully  is  the  deriva¬ 
tion  of  the  expression  (III— 32 )  for  the  variance  of  the  noise  source  introduced  in  a  butterfly  com¬ 
putation.  In  obtaining  this  result,  we  begin  by  expressing  the  butterfly  computation  (III - 1  5a )  in 
terms  of  real  arithmetic  [a  similar  analysis  could  be  carried'out  for  ( 1 1 1  —  1  5b )  ] 
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( III- 3  5a) 


Re  (Xm  +  1(i)]  =  Re  [Xm(i)]  +  Re  W  Re  [Xm(j)J  -  Im  W  Im  [Xm(j)] 

lm  [Xm+1(i)]  -  Im  [Xm(i)J  +  Re  W  Im  [Xm(j)J  +  Im  W  Re  [Xm(j)J  .  (IH-35b) 

Because  of  roundoffs,  we  actually  compute 

ft  [Re  Xm+1(i)]  =  {Re  Xm(i)  +  [Re  W  Re  Xm(j)  (1  +  e  J 

-  Im  W  Im  Xm(j)  (1  +  c2)]  (1  +  c3)}  (1  +  e 4)  (III-36a) 

ft  [Im  Xm+1(i)]  =  {Im  Xm(i)  +  [Re  W  Im  Xm(j)  (1  +  «g) 

+  Im  W  Re  Xm(j)  (i  +  e6)]  (1  +  f?)}  (1  +  £g)  .  (III-36b) 

An  expression  for  the  noise  U  (i)  introduced  by  roundoff  is  obtained  by  subtracting  ( III  -  3  5 )  from 
( III  -  36 )  and  neglecting  terms  of  second  or  higher  order  in  the  roundoff  variables  £^.  We  obtain 

Um(i)  =  Re  [Xm(i)J  (e4)  +  Re  W  Re  [Xm(j)J  (e  i  +  e  3  +  e 4)  -  lm  W  lm  [Xm(j)] 

X(£2  +  £3  +  <4)  +  j  {Im  [Xm(i)J  ( £ 8 )  +  Re  W  Im  [Xm(j)] 

X(c5  +  £?  +  £g)  +  lm  W  Re  [Xm(j)]  (£6  +  £?  +  £g)}  .  (III-37) 

For  a  white  input  signal.  Re  [Xm(i)],  Re  [Xm(j)],  Im  [Xm(i)J,  andlm  [Xm(j)]  are  mutually  uncor¬ 
related  with  equal  variances,  and  we  can  obtain  the  noise  source  variance  as 

ffu  =  glUm(i)|2  =  4ff£2glXm(i)!2  •  (III_38) 

For  the  computation  of  the  first  array,  6|Xm(i)|2  =  6|XQ(i)|2  =  £|X(n)|^  =  cr^,  and  ( III- 32) 
follows. 

In  this  analysis,  we  have  not  considered  the  fact  that  a  reduced  error  variance  is  introduced 
when  W  =  1  or  j.  In  fact,  for  W  =  1,  since  multiplications  by  1  can  be  performed  noiselessly, 
we  have  £^  =  £2  =  eg  =  £g=c^  =  e7  =  0in  (111-36),  and  the  noise  source  variance  is  half  that  given 
in  ( III  -  38 ) .  A  similar  result  holds  for  W  =  j.  For  a  specified  radix  2  algorithm,  such  as  the 
decimation  in  time  algorithm  shown  in  Fig.  10,  these  reduced  variances  for  W  =  1  and  j  can  be 
included  in  the  model  to  obtain  a  slightly  reduced  prediction  for  output  noise-to-signal  ratio. 
However,  for  reasonably  large  N,  this  modified  noise  analysis  yields  only  slightly  better  pre¬ 
dictions  of  output  noise  than  does  the  simplified  analysis  above. 

A  consequence  of  our  analysis  leading  to  (III-31)  is  that  the  output  noise  is  white.  This  fol¬ 
lows  from  the  fact  that  each  array  of  noise  sources  is  whited  The  reduced  noise  source  variance 
for  W  =  1  and  j  implies  that  for  some  arrays  there  will  be  a  variation  of  noise  source  variance 
over  the  array.  This  implies  a  slight  variation  of  output  noise  variance  over  the  output  array, 
and  thus  our  modified  noise  analysis  will  only  predict  an  average  noise  variance  over  the  output 

t  Actuolly,  os  in  the  fixed  point  cose,  the  pair  of  noise  sources  introduced  in  computing  the  two  outputs  of  o 
given  butterfly  are  correlated.  This  means  that  there  is  a  slight  carrelotian  of  noise  over  the  output  array,  but 
does  not  offect  our  result  for  output  noise  vorionce  since  the  tap  and  bottom  output  nodes  af  o  butterfly  affect  o 
disjoint  set  af  paints  in  the  output  array. 
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array.  For  the  decimation  in  time  algorithm  depicted  in  Fig.  10  (which  was  the  algorithm  used 
in  our  experimental  work),  such  a  modified  analysis  was  carried  out.  For  this  case,  only  W  1 
is  used  in  the  first  array  of  computations,  and  only  W  =  1  and  j  are  used  in  computing  array  2 
from  array.  1.  In  computing  array  3  from  array  2,  half  the  butter  flies  involve  W  =  1  or  j,  in  the 
next  array,  one-quarter  of  the  butterflies  involve  W  =  1  or  j;  etc.  The  result  of  our  analysis, 
which  is  to  be  interpreted  as  an  average  (over  the  output  array)  of  noise  to  signal  variance  is 

2 

-§-  =  2a2  [u  -(3/2)  +  (l/2)u_1]  .  (III-39) 

ffX 

As  v  becomes  moderately  large  (say,  v  >  A),  we  see  that  (III-39)  and  (III-31)  predict  essentially 

2  2 

the  same  linear  rate  of  increase  of  with  v. 

2.  Experimental  Noise  Measurements 

The  essential  result  of  the  above  analysis  is  that  mean-squared  output  noise-to-signal  ratio 

in  a  floating  point  FFT  is  proportional  to  u.  Experimental  testing  of  this  prediction  was  deemed 

particularly  important,  both  because  of  the  many  assumptions  inherent  in  the  model,  and  because 

the  prediction  seemed  to  be  contradicted  by  the  experimental  results  of  Gentleman  and  Sande, 

who  observed  (for  white  noise  input)  that  measured  output  noise-to-signal  ratios  were  approxi- 

2 

imately  proportional  ton  .  (Their  noise  measurements  were  carried  out  by  performing  a  trans¬ 
form  and  inverse,  and  subtracting  the  initial  and  final  sequences.)  Our  first  thought  at  an  ex¬ 
planation  of  the  discrepancy  between. their  results  and  ours  was  that  most  of  their  experiments 

were  carried  out  using  truncation,  whereas  our  model  applies  to  rounded  arithmetic.  However, 

2 

a  few  of  their  experiments  used  rounding,  and  even  there  a  v  dependence  of  mean-squared  out¬ 
put  noise-to-signal  ratio  was  observed. 

We  carried  out  experimental  noise  measurements  as  follows.  To  check  (111-31),  a  white 
sequence  (consisting  of  uniformly  distributed  random  variables)  was  generated  and  transformed 
twice  —  once  using  rounded  arithmetic  with  a  short  (e.g.,  12-bit)  mantissa,  and  once  using  a 
much  longer  (27-bit)  mantissa.  The  results  were  subtracted,  squared,  and  averaged  to  estimate 
the  noise  variance.  For  each  N  =  2  ,  this  process  was  repeated  for  several  white  noise  inputs 
to  obtain  a  stable  estimate  of  roundoff  noise  variance.  To  check  (III- 34a ),  white  sequences  were 
put  through  an  FFT  and  inverse  (using,  for  example,  12-bit  rounded  arithmetic),  and  the  mean- 
squared  difference  between  initial  and  final  sequences  was  taken. 

The  initial  results  we  obtained  seemed  to  be  more  in  agreement  with  Gentleman  and  Sande's 

experiments  than  with  our  theory,  since  the  measured  noise-to-signal  ratios  were  higher  than 

2 

we  predicted  and  could  be  fitted  more  closely  with  a  curve  of  the  form  av  than  with  a  linear 
curve  in  v.  The  explanation  for  this  seemed  to  be  that  our  model  of  independent  noise  generators 
must  be  incorrect,  and  a  significant  correlation  between  signals  and  floating  point  roundoff  er¬ 
rors  must  exist.  However,  the  rounding  procedure  used  in  these  initial  experiments  did  not  in¬ 
clude  a  random  choice  as  to  whether  to  round  up  or  down  when  a  result  lay  midway  between  two 
machine  numbers.  We  merely  looked  at  the  first  extra  bit  of  the  (unsigned)  mantissa,  added 
2  4  to  the  mantissa  if  this  bit  were  1,  and  otherwise  simply  truncated  to  t-bits.  When  we  mod¬ 
ified  this  procedure,  and  rounded  randomly  up  or  down  in  the  midway  situation  (where  the  first 
extra  bit  of  the  mantissa  is  1,  and  all  remaining  extra  bits  are  0),  the  experimental  noise-to- 
signal  ratios  followed  our  predictions  quite  closely.  It  turns  out  that,  in  floating  point  addition 
of  two  numbers  of  the  same  order  of  magnitude,  this  midway  situation  occurs  quite  frequently. 
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Rounding  always  up,  rather  than  randomly  up  or  down,  in  this  situation  introduces  a  correlation 
between  roundoff. error  and  signal  sign. 

Our  results  for  white  input  signals  are  summarized  in  Figs.  15  and  16.  Figure  15  shows  the 
results  for  an  FFT,  and  Fig.  16  displays  our  results  for  an  FFT  and  inverse.  The  figures  show 
straight  lines  corresponding  to  the  simple  theoretical  results  (111-31)  and  (111 - 34a );  also  shown 
are  modified  theoretical  curves  corresponding  to  the  analysis  leading  to  (111-39).  Experimental 
results  corresponding  both  to  our  initial  experiments  (nonrandomized  rounding)  and  to  our  ex¬ 
periments  using  randomized  rounding  are  presented.  The  results  for  randomized  rounding  cor¬ 
respond  very  closely  with  our  modified  noise  analysis,  and  reasonably  closely  with  our  simpler 
analysis.  The  results  for  nonrandomized  rounding  do  not  follow  our  theory  very  well,  and  can 
better  be  fitted  by  the  quadratic  curves  shown.  Actually,  when  calibrated  in  terms  of  bits,  the 
results  for  randomized  and  nonrandomized  rounding  are  not  that  different  since  a  factor  of  4  on 
Figs.  15  or  16  corresponds  to  1-bit  difference  in  rms  noise-to-signal  ratio.  For  the  plotting  of 
theoretical  curves,  the  empirical  value 

a2  =  (0.21)2'2t  (111-40) 

was  used  for  the  variance  of  the  roundoff  variables.  This  is  very  close  to  the  value  (11-48)  which 
was  found  empirically  to  fit  noise  measurements  on  simple  recursive  filters. 

For  the  case  of  nonrandomized  rounding,  experiments  were  performed  to  isolate  the  effects 
of  error  due  to  rounding  of  multiplications  and  additions.  For  example,  to  measure  the  total 


Fig.  15.  Experimentol  and  theoretical  noise-to' 
signol  rotios  for  flooting  point  FFT. 


Fig.  16.  Experimentol  and  theoretical  noise-to- 
signol  rotios  for  flooting  point  ond  inverse  FFT's. 
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noise  due  just  to  additions,  all  multiplications  were  rounded  to  27  bits,  while  additions  were 
rounded  to  12  bits.  We  found  that  the  output  noise-to-signal  ratio  for  this  case  (of  inaccurate 
additions  and  accurate  multiplications)  increased  faster  than  v,  and  was  almost  as  large  as  when 
additions  and  multiplications  were  both  performed  inaccurately.  The  output  noise-to-signal 
ratio,  measured  with  inaccurate  multiplications  and  accurate  additions,  was  much  lower,  and 
increased  essentially  linearly  with  u.  These  results  supported  our  conjecture  that  the  signal- 
noise  correlation  was  introduced  mainly  in  the  rounding  of  additions. 

Noise-to-signal  ratios  were  also  measured  for  the  case  where  truncation,  rather  than  round¬ 
ing,  was  used  in  the  arithmetic.  As  might  have  been  expected,  the  measured  ratios  were  signif¬ 
icantly  higher  than  in  the  case  of  rounding,  and  were  observed  to  increase  more  than  linearly 

with  u.  The  results  are  shown  in  Fig.  17  for  both  a  transform,  and  a  transform  and  inverse. 

2  -2t  2 

For  the  latter  case,  the  experimental  values  of  c„/Z  a  were  halved  for  convenient  plotting. 

2  x 

Quadratic  curves  of  the  form  au  have  been  fitted  to  the  experimental  results.  We  note  that  the 
measured  values  for  transform  and  inverse  were  somewhat  more  than  double  the  results  for  a 
single  transform. 

At  this  point,  an  important  comment  should  be  made  concerning  the  fitted  quadratic  curves 
of  Figs.  15  to  17.  These  curves  represent  merely  an  empirical  fit  to  the  data,  and  do  not  cor¬ 
respond  to  any  theoretical  result  which  we  have  been  able  to  derive.  They  are  drawn  to  indicate 
that  the  form  of  our  experimental  data  does  not  disagree  with  that  of  Gentleman  and  Sande.  How¬ 
ever,  many  other  curves  could  be  fitted  to  the  data,  and  the  fact  that  the  bound  on  output  noise-to- 

2 

signal  ratio  is  proportional  to  u  does  not  imply  that,  if  we  could  model  theoretically  the  correla¬ 
tion  between  signal  and  roundoff  error,  the  predicted  noise-to-signal  ratio  would  be  proportional 


Fig.  )7.  Experimental  naise-ta-signal  ratios 
far  floating  paint  FFT  and  FFT  —  inverse  FFT; 
truncation  used  instead  af  raunding. 
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to  v  .  In  the  bounding  analysis,  complete  correlation  between  noise  sources  is  assumed,  whereas 
(for  nonrandomized  rounding  or  truncation)  we  are  dealing  with  only  slight  correlation.  The  ex¬ 
perimental  results  are  well  below  Gentleman  and  Sande's  deterministic  bounds  on  output  noise- 
to-signal  ratio. 

Our  analysis,  and  all  the  above  experiments,  applied  to  the  case  of  white  signal.  Some  ex¬ 
perimental  investigation  has  been  carried  out  as  to  whether  our  predictions  are  anywhere  near 
valid  when  the  signal  is  nonwhite.  Specifically,  the  noise  introduced  in  computing  an  FFT  was 
measured  for  sinusoidal  signals  of  several  frequencies,  for  v  8,  9,  10,  and  11.  The  results, 
averaged  over  the  input  frequencies  used,  were  within  15  percent  of  those  predicted  by  (III  —  39). 

In  these  experiments,  our  "randomized"  rounding  procedure  was  used. 


3.  FFT  vs  Direct  Fourier  Transform  Accuracy 

An  issue  of  some  interest  is  the  question  of  how  the  accuracy  obtained  in  an  FFT  computa¬ 
tion  compares  with  that  obtained  in  computing  a  DFT  by  "straightforward"  means,  i.e.,  direct 

28 

summing  of  products  as  implied  by  the  definition  (III  —  2 )  of  the  DFT.  Gentleman  and  Sande 
studied  this  problem  for  the  floating  point  case,  obtaining  (for  both  the  direct  and  FFT  methods) 
deterministic  bounds  on  the  ratio  of  the  rms  output  error  to  the  rms  output  signal.  The  bound 
they  obtained  for  the  direct  method  increases  much  faster  with  N  than  the  corresponding  bound 
for  the  FFT. 

A  quantitative  statement  of  their  results  is  as  follows.  Consider  an  N-point  sequence  x(n). 
First,  suppose  we  compute  its  DFT  and  then  its  inverse  DFT  by  direct  summing  of  products, 
obtaining  a  result  x(n)  +  e(n),  where  e(n)  is  the  error  due  to  roundoffs.  Then, 


1  n_1 

N  „ 
n=0 


1 e(n)  I 


1  n'1 

—  T 
N  n 
n=0 


«  (2)(1.06)2_t(2N),//2 


(III— 41) 


'x(n) 


(Similarly,  a  bound  could  be  derived  for  a  single  transform,  with  no  inverse.) 

Now,  suppose  we  compute  the  DFT  and  inverse  by  means  of  a  radix  2  FFT  algorithm,  where 
N  -  2U.  The  bound  on  rms  output  error  F  rms  output  signal  becomes 


<  (2)(1.06)2_t(2)3/2v  (III  —  42 ) 

rms 

For  large  N,  (III -42 )  is  a  much  lower  bound  than  (III -4 1 ) .  It  is  of  interest  to  compare  the  bound 
( II I -42 )  with  our  corresponding  result  (III-34a),  which  was  obtained  via  a  statistical  analysis. 
Stated  in  rms  terms,  our  statistical  result  is 


°  E 

(21(0.21) 


1/2, -t  1/2 
'2  v 


( III-43) 


1/2 

Our  result  is  proportional  to  v  '  rather  than  v,  because  in  the  statistical  analysis  it  was  as¬ 
sumed  that  errors  from  the  different  stages  of  the  FFT  added  as  independent  random  variables. 
In  the  bounding  analysis  leading  to  ( III -42 ),  it  was  necessary  to  assume  that  the  errors  from  dif¬ 
ferent  stages  add  as  completely  correlated  noises,  and  in  the  worst  possible  way. 
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In  deriving  their  bound  ( III— 4 1 )  for  the  direct  Fourier  transform  case.  Gentleman  and  Sande 

s  t, 

assume  that  the  summing  of  products  is  performed  in  a  cumulative  way  —  that  is,  the  (n  +  I) 
product  is  added  to  the  sum  of  the  first  n  products,  and  so  on.  However,  a  slightly  different, 
more  error-free  technique  could  be  used  for  summing  the  products.  Suppose  instead,  we  were 
to  sum  (III-2)  in  a  treelike  fashion;  that  is,  the  N  products  were  summed  in  pairs,  the  N/2  re¬ 
sults  summed  in  pairs,  etc.  Then,  for  a  OFT  and  inverse,  the  bound  corresponding  to  (111-41) 
is 


^2^  <  2(1.06)2"t(u  +  1)  n/2N  .  (III-43) 

rms 

This  bound  is  lower  than  (III-41),  but  still  increases  faster  with  N  than  the  bound  (III- 42 )  for  the 
FFT.  A  qualitative  explanation  for  this  latter  fact  is  as  follows.  In  the  tree-sum  computation, 
all  output  points  are  computed  independently,  so  that  in  deriving  a  bound  we  must  assume  that 
all  errors  in  the  output  array  assume  their  maximum  possible  values.  In  the  FFT  computation, 
the  output  points  are  not  computed  independently.  If  the  effects  of  roundoff  errors  reinforce  each 
other  in  the  worst  possible  way  at  a  particular  output  point,  they  will  tend  to  cancel  at  another 
output  point. 

In  the  above  discussion,  we  have  considered  deterministic  bounds,  rather  than  a  statistical 

analysis  of  the  output  error.  A  statistical  analysis  of  roundoff  errors  incurred  in  implementing 

the  DFT,  using  the  treelike  summation  technique,  predicts  a  linear  dependence  of  the  rms  out- 

1  /2 

put  noise-to-signal  ratio  on  v  '  ,  similar  to  (III-43).  Thus,  using  a  statistical  analysis,  we 
would  conclude  that  the  accuracy  of  a  direct  "tree-summed"  Fourier  transform  is  essentially  the 
same  as  the  accuracy  of  an  FFT. 

We  should  note  that,  compared  with  cumulative  summing,  the  treelike  summation  requires 
some  additional  memory  (to  store  partial  sums),  and  perhaps  more  inconvenient  indexing.  How¬ 
ever,  for  any  reasonably  large  N,  this  issue  is  academic  since,  for  reasons  of  increased  speed, 
we  would  use  the  FFT  (and  neither  cumulative  nor  treelike  summation)  to  perform  the  computation. 


E.  SUMMARY  OF  FFT  QUANTIZATION  EFFECTS 

Let  us  summarize  and  comment  on  our  main  results  concerning  quantization  effects  in  the 
FFT. 

In  Sec.  III-A,  we  studied  the  effect  of  coefficient  quantization.  We  represented  the  error  due 
to  coefficient  quantization  as  the  result  of  multiplication  of  the  input  sequence  by  a  "stray"  ma¬ 
trix  determined  by  the  coefficient  errors.  To  analyze  the  situation  further,  we  needed  to  make 
the  fairly  crude  assumption  that  all  elements  of  the  stray  matrix  were  uncorrelated.  Using  this 
assumption,  we  obtained  the  result  (for  fixed  point,  rounded  coefficients) 

— I]  =  (v/6)2"2t  ( I II — 44 ) 

^X/coefficient  errors 


for  the  mean-squared  output  er ror-to-signal  ratio.  For  floating  point,  rounded  coefficients, 
essentially  the  same  result  (except  for  a  multiplicative  constant)  was  obtained.  The  key  conclu¬ 
sion  is  that  this  error-to-signal  ratio  increases  as  v  =  log^  N,  and  experimental  results  (as  dis¬ 
played  in  Fig.  9)  were  in  reasonable  agreement  with  this  conclusion.  This  result  can  help  us 
estimate  how  many  bits  of  accuracy  are  needed  in  our  coefficient  table,  for  a  given  requirement 
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on  output  error-to-signal  ratio.  In  general,  the  number  of  bits  retained  in  the  coefficients  need 
not  be  equal  to  the  number  of  bits  retained  in  the  rounding  of  multiplications.  For  example,  in 
fixed  point,  the  ratio  of  the  variance  of  output  noise  due  to  roundoffs,  to  output  signal  variance, 
is  proportional  to  N  rather  than  to  log.,  N  as  in  (III-44).  Thus,  for  large  N,  it  would  be  reason¬ 
able  to  retain  fewer  bits  in  the  coefficients  than  in  the  rounding  of  multiplications. 

In  Sec.  III-B,  we  analyzed  the  effects  of  roundoff  errors  and  the  overflow  constraint  to  obtain 
formulas  for  output  noise-to-signal  ratio  in  a  fixed  point  FFT.  We  found  that  to  guarantee  against 
overflow  (using  preset  attenuators,  and  no  overflow  test)  and  maintain  the  lowest  possible  noise- 
to-signal  ratio,  it  was  desirable  to  keep  the  input  array  as  large  as  possible  [under  the  constraint 
|x(n)|  <  1)  and  incorporate  attenuations  of  l/2  at  each  stage.  If  this  method  were  used,  the  re¬ 
sulting  prediction  for  output  noise-to-signal  ratio  (for  rounded  arithmetic)  was 


(111—46) 


Our  analysis  also  yielded  the  result  that  the  output  noise  sequence  E(k)  is  essentially  white.  An 
implication  of  the  result  (III -4 5)  is  that  if  we  double  N,  adding  another  stage  to  the  FFT,  the  rms 
output  noise-to-signal  ratio  increases  by  half  a  bit.  We  validated  (III— 45)  experimentally,  for 
both  white  and  sinusoidal  input  sequences.  We  investigated  experimentally  the  question  of  how 
our  results  would  change  if  truncation  rather  than  rounding  were  used,  and  found  that,  although 
the  actual  values  of  noise-to-signal  ratio  were  higher  for  truncation,  the  half -bit -per-stage  slope 
was  retained.  This  might  have  been  expected  since  this  slope  [the  proportionality  to  N  in  (III-45)] 
is  determined  chiefly  by  the  signal  attenuation  (same  for  rounding  and  truncation)  and  not  by  the 
noise  accumulation. 

Using  (III-45)  in  conjunction  with  ( III -44 ),  we  could  specify  register  length  requirements  for 

a  multiplier  to  be  used  in  a  fixed  point  FFT.  To  illustrate  how  this  might  be  done,  let  us  con- 

10 

sider  a  specific  example.  Suppose  the  longest  FFT  we  wished  to  compute  was  N  =  2  =  1024. 

Then,  from  (III -44 ),  we  obtain 


log2(5/3)l/Z  -tc 


1/2  log_  (<r l/o  1) 


2  '  E'  X'coefficient 


=  0.37  -t 

c 

O 

and,  from  (III-45),  assuming  cr^  =  l/3  (as  for  white  input). 


(III -46 ) 


1/2  log-  (CT  Z/a  Z) 


2  vu  E/  X) roundoff 


=  6.16  -t 


(III  —  47 ) 


r 


where  t  and  t^  represent  the  number  of  bits  retained  (not  including  sign)  in  the  coefficients  and 
rounded  results,  respectively.  Suppose  we  require  that  neither  the  error-to-signal  ratio  due  to 
coefficient  quantization  nor  the  noise-to-signal  ratio  due  to  roundoff  exceeds  —  60  dB.  This  im¬ 
plies  that  both  ( III -46 )  and  ( III -47 )  be  less  than  —10  bits,  so  that  the  minimum  register  lengths 
are  t  =  11  bits  and  t^  =  17  bits.  Thus,  if  we  include  the  sign  bits,  a  12  x  18-bit  multiplier  would 
be  required  —  12  bits  for  the  coefficient,  and  18  bits  for  the  signal.  An  18-bit  rounded  output 
would  be  needed  from  each  multiply. 
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In  Sec.  I1I-C,  we  studied  the  effects  of  roundoff  noise  in  a  block  floating  point  FFT,  where 
the  array  was  scaled  by  overflow  tests  and  right  shifts.  We  observed  that  fixed  point  computa¬ 
tion,  where  scalings  are  imposed  at  every  stage,  is  a  worst  case  (with  respect  to  noise-to-signal 
ratio)  for  block  floating  point.  We  found  that  for  inputs  with  a  fairly  flat  spectrum,  significant 
improvements  over  the  fixed  point  case  could  be  attained  in  output  noise-to-signal  ratio.  Detailed 
analysis  and  experimentation  were  carried  out  for  the  case  of  white  input  signal. 

In  Sec.  I1I-D,  a  statistical  model  for  roundoff  errors  was  used  to  predict  output  noise-to- 
signal  ratio  in  a  floating  point  FFT.  The  result,  derived  for  the  case  of  white  input  signal,  was 

cte\  -2t 

— |  ]  =  (0. 42)u2  .  (111-48) 

CTX/roundoff,  floating  point 

This  predicted  result  was  significantly  lower  than  bounds  previously  derived  on  mean-squared 

2 

output  noise-to-signal  ratio,  which  were  proportional  to  u  .  The  result  was  verified  experimen¬ 
tally  when  "randomized  rounding"  was  used.  When  either  "nonrandomized  rounding"  or  trunca¬ 
tion  was  used,  a  greater  than  linear  increase  with  u  of  the  output  noise-to-signal  ratio  was  ob¬ 
served.  This  was  attributed  to  the  fact  that  these  procedures  introduced  a  correlation  between 
the  rounding  errors  and  the  rounded  quantities.  An  implication  of  (111-48)  and  (111-14)  [the  float¬ 
ing  point  counterpart  of  (111-44)]  is  that,  for  floating  point,  the  noise-to-signal  ratios  due  to  both 

-2t 

coefficients  and  roundoffs  are  proportional  tou2  ,  so  that  it  seems  reasonable  to  retain  the 
same  number  of  bits  for  the  coefficients  as  for  rounding. 

We  concluded  Sec.  Ill  with  a  discussion  of  the  relative  accuracy  of  the  FFT  (floating  point) 
and  a  direct  Fourier  transform.  It  was  pointed  out  that  if  a  treelike  summation  is  used  to  carry 
out  the  direct  Fourier  transform,  then  (on  a  statistical  basis)  its  accuracy  is  essentially  the  same 
as  that  of  the  FFT. 
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IV.  FREQUENCY  SAMPLING  FILTERS 


2  4 

We  now  study  quantization  effects  in  frequency  sampling  filters.  ’  The  term  "frequency 

sampling  filter"  is  used  here  to  refer  to  a  particular  realization  of  a  finite  duration  impulse 

response  filter,  by  means  of  a  comb-filter  and  a  bank  of  resonators.  The  terminology  is  a  bit 

misleading,  since  "frequency  sampling"  is  actually  more  descriptive  of  the  design  procedure 

used  than  of  the  particular  implementation.  Once  designed,  the  filter  could  be  realized  in  a 

variety  of  ways,  including  fast  convolution  (using  the  FFT).  In  fact,  an  important  objective  of 

this  section  and  Sec.  V  is  to  compare,  on  the  basis  of  differing  quantization  effects,  recursive 

and  FFT  realizations  of  finite  duration  impulse  response  filters. 

32 

Gold  and  Jordan  have  shown  that  any  finite  duration  impulse  response  filter  can  be  real¬ 
ized  in  the  frequency  sampling  form,  that  is  by  means  of  a  comb-filter  and  a  bank  of  resonators. 
Consider  a  filter  whose  impulse  response  h(n)  is  time-limited  to  N  samples.  Its  transfer 
function  is 

N-l 

H*(z)=  2  h(n)z"n  .  (IV- 1 ) 

n=0 

But  the  finite  duration  h(n)  can  be  expressed  in  terms  of  its  LIFT,  H(k),  as 

N-l 

h(n)  =  2  H(k)  exp  fj(2ir/N)  nk|  .  (1V-2) 

k  0 


The  H(k)  represent  N  samples,  uniformly  spaced  around  the  unit  circle,  of  the  z-transform 
H(z).  Using  (IV- 2 )  in  (IV- 1),  we  can  express  H,;‘(z)  in  terms  of  its  frequency  samples  as 


H*(z)  = 


N-l 

V 

Lt 

k=0 


H(k ) 

I_ej2,k/Nz-1 


(IV- 3) 


-N 

This  is  basically  the  frequency  sampling  realization.  The  term  (1  —  z  )  represents  a  comb¬ 
filter,  which  has  N  zeros  equally  spaced  around  the  unit  circle.  The  sum  represents  a  bank  of 
single  pole  filters.  Each  pole  cancels  one  of  the  zeros  of  the  comb-filter,  so  that  the  composite 
filter  has  only  zeros.  The  continuous  frequency  response  H^fe^)  passes  exactly  through  its 
N  samples  H(k)  according  to 


H*  (e**w 


w=2irk/N 


=  H(k) 


(IV-  4 ) 


If  we  confine  attention  to  the  case  where  the  H(k)  are  real,  and  H(k)  =  H(N  —  k),  we  can  com¬ 
bine  complex  conjugate  terms  in  (IV- 3)  to  obtain 


H*  (z ) 


+ 


[N/2] 

E 

k  =  l 


2 H(k )  [1  -  cos  (2>rk/N) 

1  —  2  cos  (27rk/N)  z 


(IV- 5) 


where 


[N/2]  = 


N/2 

(N  -  1  )/2 


N  even 
N  odd 
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This  filter  can  be  realized  using  resonators  with  real  coefficients,  and  has  linear  phased  The 
filter  can  be  designed  by  specifying  the  frequency  samples  H(k),  or  by  any  of  the  usual  methods 
for  designing  nonrecursive  filters.  Usually,  we  would  choose  the  H(k)  to  alternate  in  sign  so 
that  the  continuous  magnitude  response  |H:<t(e''w)|  will  smoothly  interpolate  between  the  sample 
values.  Excellent  filter  designs  (band-  or  low-pass  filters  with  low  sidclobcs  and  linear  phase) 
corresponding  to  the  form  (1V-5)  can  be  obtained. 

However,  the  quantization  problems  involved  in  implementing  (IV- 5)  are  quite  severe, 
basically  due  to  the  need  for  pole-zero  cancellations  on  the  unit  circle.  Our  aim  here  is  to 
study  these  quantization  effects  so  that  register  length  requirements  can  be  decided  upon.  In 
Sec.  V,  we  shall  consider  the  quantization  problems  which  occur  when  the  same  filter  designs 
are  implemented  via  the  route  of  the  FFT. 

In  practice,  we  would  not  attempt  to  realize  exactly  the  unit  circle  pole- zero  cancellations 
implied  by  (IV- 5),  but  would  move  the  poles  of  the  resonators  and  the  zeros  of  the  comb-filter 
slightly  inside  the  unit  circle,  so  that 


H*(z)  = 


N  -N 
r  z 


N 


H(0) 


[N/2] 

+  2 

k  =  l 


2H(k)  [1  -  cos  (27rk/N)  z'1] 


1  —  2r  cos  (2»rk/N)  z”  1  +  r2z  2 


(IV- 6) 


where  r  is  chosen  slightly  less  than  1,  say, 

r  =  i-6  .  (IV-7) 

A  block  diagram  of  the  frequency  sampling  filter  is  shown  in  Fig.  18.  The  distance  6  of  the 
poles  from  the  unit  circle  is  an  important  parameter  which  must  be  chosen  by  the  designer.  As 


Fig.  18.  Black  diagram  af  frequency  sampling  filter. 

we  decrease  6,  both  roundoff  noise  and  coefficient  quantization  problems  become  more  severe. 
But  as  6  is  made  larger,  the  filter  characteristics  begin  to  deviate  more  and  more  from  those 
of  the  original  frequency  sampling  filter,  which  is  generally  designed  assuming  6  =  0. 


tActually,  precise  linear  phase  will  be  obtained  if  N  is  odd.  If  N  is  even,  only  approximate  linear  phase 
will  be  obtained,  unless  h(0)  =  0,  in  which  case  precise  linear  phase  is  obtained. 
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A.  COEFFICIENT  QUANTIZATION^ 

Coefficient  quantization  causes  the  pole  pair  of  each  resonator  in  our  frequency  sampling 
filter  to  deviate  in  position  in  the  z-plane  from  the  corresponding  zeros  of  the  comb-filter. 

Since  the  poles  and  zeros  are  very  close  to  the  unit  circle,  the  imperfect  pole-zero  cancella¬ 
tions  can  cause  severe  ripples  in  the  filter's  frequency  response.  This  can  be  compensated  for 
somewhat  by  increasing  the  parameter  <5,  to 
move  the  poles  and  zeros  further  inside  the 
unit  circle.  However,  making  6  too  large  will 
also  cause  the  frequency  response  to  change 
appreciably  from  that  of  the  original  frequency 
sampling  filter,  which  is  generally  designed 
assuming  6  =  0.  The  expected  deviation  in  pole 
positions  clearly  increases  with  decreased 
register  length,  so  that  one's  choice  of  <5 
should  depend  on  the  register  length.  Also,  the 
expected  deviation  in  pole  positions  depends  on 
the  particular  form  of  resonator  chosen  (e.g., 
coupled  or  direct).  When  resonators  of  low 
resonant  frequency  are  required,  the  lower  co¬ 
efficient  sensitivity  of  the  coupled  form  might 
induce  one  to  choose  this  form,  despite  the 
added  computation  needed  in  its  operation. 

As  we  see  from  this  discussion,  the  key 
problem  here  is  one  we  have  discussed  in  de¬ 
tail  in  a  preceding  section,  namely  the  problem 
of  coefficient  quantization  in  digital  resona¬ 
tors.  Rather  than  repeat  any  analysis,  we 

shall  illustrate  here  by  example  the  composite  c-  c  c  i  r 

*'  rig.  IV.  Frequency  response  of  low-pass  frequency 

effects  of  coefficient  quantization  on  a  compli-  sampling  filter, 

cated  frequency  sampling  filter.  We  focus  on 

the  case  of  fixed  point  coefficients,  making  only  a  brief  qualitative  comment  as  to  how  we  might 
expect  the  floating  point  case  to  compare  with  the  fixed  point  case. 

A  low-pass  filter  design,  suitable  for  implementation  via  Fig.  18  [or  ( IV -6) ],  was  borrowed 
from  Jordan  and  Gold.-*-*  Its  specifications  are:  H(k)  =  (—1)^,  k  =  0,  1,  ...  ,  31;  H(3Z)  =  0.7; 

H( 33)  =  -0.225;  H(34)  =  0.01995;  H(k)  =  0,  k  =  35,  36 _ _  N/Z;  H(k)  =  H(N  -k);  N  =  256.  Its  fre¬ 

quency  response  |H*  (e-*a>^')|  (we  recall  that  we  are  assuming  for  convenience  that  T  =  1)  is  shown 
in  Fig.  19.  The  maximum  peak-to-peak  passband  ripple,  as  defined  in  the  figure,  is  33.23  dB 
down  from  the  DC  gain,  and  the  maximum  sidelobe  is  84.57  dB  down  from  the  DC  gain.  Only  the 
envelope  of  the  sidelobes  is  shown. 

The  effects  of  coefficient  quantization  were  examined  as  follows.  A  value  of  6  was  chosen, 
and  the  filter  coefficients  were  computed  and  quantized  to  a  chosen  number  of  bits  (18  or  less). 
The  filter,  with  these  quantized  coefficients,  was  pulsed,  using  36-bit  arithmetic  to  minimize 
the  effect  of  roundoff  noise,  and  the  impulse  response  thus  obtained  was  Fourier  transformed 

t  Most  of  the  results  reported  here  have  been  previously  reported  by  the  author.^ 
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(also  36-bit  arithmetic)  to  find  the  frequency  response.  With  quantized  coefficients,  the  impulse 
responses  did  not  exactly  terminate  after  256  samples.  However,  it  was  observed  that,  after 
2048  samples,  the  impulse  response  was  zero  or  negligible,  and  this  was  the  number  of  points 
used  in  computing  the  DFT. 

For  the  first  experiment,  the  resonators  were  programmed  using  the  direct  form  recursion 
yk(n)  =-r2yk(n-  2)  +  2r  cos  yk<n  -  1)-  cos  w(n-  1)  +  w(n)  (IV-  8 ) 

for  the  resonator.  The  comb-filter  was  programmed  as 

N 

Nw(n)  =  x(n)  —  r  x(n  —  N)  .  (IV- 9  ) 

The  first  half  of  Table  II  shows  the  performance  of  this  filter  as  a  function  of  register 
length  used  for  the  coefficients  (sign  included),  compared  with  the  36-bit  "ideal"  performance. 
For  each  register  length,  a  range  of  values  of  6  was  tried  in  order  to  decide  upon  that  value 
which  yielded  performance  as  close  as  possible  to  the  ideal.  The  criteria  used  for  the  search 


TABLE  II 

PERFORMANCE  OF  FREQUENCY  SAMPLING  FILTER 

WITH  QUANTIZED  COEFFICIENTS 

Performance  with  Direct  Farm  Resonators 

Register  Length 

5 "apt" 

Maximum  Sidelobe* 

(dB) 

Maximum  Passband 
Ripple  t 
(dB) 

36 

2"17 

84.57 

33.23 

-9 

18 

2 

82.63 

30.78 

-9 

16 

2 

82.27 

18.00 

-8 

14 

2  0 

78.42 

4.23 

12 

2’7 

73.98 

7.81 

Performance  with  Coupled  Resonators 

36 

2'17 

84.57 

33.23 

-9 

18 

2 

84.04 

32.84 

-9 

16 

2 

83.84 

31.64 

14 

2'8 

80.10 

28.61 

12 

2'7 

68.41 

21.81 

*Maximum  sidelobe  =  DC  gain  +  maximum  gain  in  stopband. 

t  Maximum  passband  ripple  =  DC  gain  +  maximum  peak-ta-peak  ripple  in  passband. 
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were  the  ripple  and  sidelobe  levels,  which  were  weighed  against  each  other  somewhat  arbitrarily. 
When  two  values  of  6  yielded  equal  or  nearly  equal  performance,  the  larger  value  was  chosen 
since  this  would  lead  to  less  roundoff  noise.  The  chosen  value  of  6,  which  we  have  called  6„ 
and  the  resulting  maximum  sidelobe  and  passband  ripple,  are  shown  for  each  register  length. 

(Note  that,  according  to  our  definitions,  larger  numbers  in  Table  11  correspond  to  lower  side- 
lobes  and  ripple.)  For  18  bits,  we  see  that  reasonably  close  to  ideal  behavior  was  obtained;  but 
for  16  bits  or  less,  significant  degradations  occur.  The  largest  degradation  is  the  increased 
passband  ripple.  The  largest  ripples  were  observed  to  occur  at  low  frequencies,  which  can  be 
explained  by  the  fact  (see  Sec.  11- A )  that  for  direct  form  resonators  with  low  resonant  frequencies, 
the  angles  of  the  poles  are  very  sensitive  to  coefficient  quantization.  Thus,  the  deviation  of  the 
pole  positions  increases  both  with  decreasing  resonant  frequency  and  with  decreasing  register 
length. 

The  sensitivity  to  quantization  of  the  low-frequency  poles  (excluding  the  real- axis  pole)  can 
be  reduced  if  the  direct  form  resonator  is  replaced  by  a  coupled  form.  This  was  tried,  replacing 
(IV- 8)  by  the  coupled  first-order  equations 

yk(n)  =  r  cos  yk(n  -  1 )  -  r  sin  uk(n  -  1 )  +  w(n) 

uk(n)  =  r  sin  pp  yk<n  -  1 )  +  r  cos  pp  uk(n  -  1 )  (IV- 10) 

whose  transfer  function  Yk(z)/W(z)  is  the  same  as  in  (IV- 8)  except  for  a  very  slight  shift  in  the 
position  of  the  one  z- plane  zero.  Significant  improvements  resulted,  as  we  see  by  reference  to 
Table  11.  For  example,  using  coupled  resonators  and  16- bit  coefficients,  we  get  better  perform¬ 
ance  than  with  direct  form  resonators  and  18  bits.  Even  with  14  bits,  the  coupled  form  yields 

performance  reasonably  close  to  the  36-bit  standard. 

We  would  expect  results  similar  to  these  to  occur  for  any  low-pass  frequency  sampling  filter, 
and  for  such  a  filter  we  might  well  consider  using  the  coupled  form,  even  though  execution  of 
(IV -10)  requires  more  multiplications  than  (1V-8).  For  a  band-pass  filter,  direct  form  resona¬ 
tors  should  probably  be  used. 

A  review  of  Sec.  11- B  would  seem  to  indicate  that,  for  floating  point,  the  comparison  between 
frequency  sampling  filters  using  direct  and  coupled  form  resonators  would  be  qualitatively  quite 
similar  to  the  fixed  point  case.  To  find  out  the  details  of  the  effects  of  coefficient  quantization 
on  a  floating  point  frequency  sampling  filter,  we  would  probably  have  to  resort  to  experiments 
similar  to  the  above. 

B.  ROUNDOFF  NOISE  -  FIXED  POINT 

The  analysis  of  roundoff  noise  effects  in  frequency  sampling  filters  is  a  straightforward  ap¬ 
plication  of  the  techniques  of  preceding  sections.  Of  particular  relevance  is  Sec.  ll-C-3,  where 
noise  formulas  for  digital  resonators  are  given.  The  output  noise- to- signal  ratio  in  a  frequency 
sampling  filter  is  strongly  dependent  on  the  parameter  6,  the  distance  of  the  resonator  poles 
from  the  unit  circle.  In  general,  as  6  is  made  smaller,  the  output  noise- to- signal  ratio  increases, 
so  that  we  would  want  to  choose  the  largest  possible  6  while  retaining  within  some  required 
tolerance  the  characteristics  of  the  originally  designed  filter.  This  issue  was  considered  in 
Sec.  A  above,  and  was  taken  into  account  in  choosing  the  values  ^n0ptn  in  Table  II.  The  effect 
of  roundoff  noise  will  also  depend  on  what  realization  form  is  chosen  for  the  digital  resonators. 
Here,  we  shall  consider  both  the  direct  and  the  coupled  forms.  For  frequency  sampling  filters, 
we  intend  to  study  the  roundoff  noise  problem  in  detail  only  for  the  fixed  point  case. 
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Let  us  refer  to  the  block  diagram  of  Fig.  18,  in  order  to  point  out  the  noise  sources  in  a 
frequency  sampling  filter.  The  output  of  the  comb-filter  is  corrupted  by  a  small  amount  of 
white  noise  due  to  the  multiplication  by  r  .  The  attenuator  1/N  reduces  this  comb-filter  noise 
to  a  negligible  level  compared  with  the  noise  due  to  roundoff  of  the  multiplication  by  the  atten¬ 
uator.  Thus,  at  the  input  to  the  resonator  bank,  we  have  the  signal  w(n)  plus  white  noise  e  (n) 

2  -  Zt  1 

of  variance  oq  =  2  /12  (assuming  rounded  arithmetic).  e^(n)  propagates  through  the  resonator 

bank  to  form  one  component  of  the  output  noise.  The  remainder  of  the  output  noise  is  the  sum 
of  the  noises  at  the  resonator  outputs,  due  to  roundoffs  in  the  resonators.  In  computing  the  out¬ 
put  noise  variance,  the  only  new  calculation  needed  is  the  calculation  of  the  effect  of  e^n),  since 
we  can  use  previous  formulas  for  the  noises  due  to  the  resonators. 

Let  us  denote  by  e,(n)  the  part  of  the  output  noise  due  to  propagation  of  e  (n)  through  the 

1  2  2  1 

filter  bank,  and  denote  the  variance  of  e,(n)  by  cr.  .  Exact  calculation  of  cr.  is  laborious,  since 

1  e1  ed 

there  is  overlap  between  the  frequency  responses  of  the  filters  in  the  bank.  However,  since  the 
resonators  have  very  narrow  bandwidth,  it  seems  reasonable  to  neglect  this  overlap.  Then,  us¬ 
ing  high- gain  approximations  for  the  variance  at  each  resonator  output,  and  adding  variances, 
we  obtain 


[N/2] 

1/2H2(0)+  Yj  N2(k) 
k=l 


(lV-il) 


This  formula  holds  whether  direct  or  coupled  form  resonators  are  used.  The  approximation  is 
valid  providing  the  bandwidth  (approximately  26)  of  each  resonator  is  sufficiently  small  compared 
with  the  angular  spacing  2tr/N  between  adjacent  resonators.  Recall  that  t  denotes  the  fixed  point 
word  length  (excluding  sign). 

Let  us  denote  the  total  output  noise  due  to  roundoffs  in  the  resonators  by  e, (n),  with  variance 

2  ^ 

<j  which  will  depend  on  the  form  chosen  for  the  resonators. 

e2 

If  the  direct  form  is  used,  then  [see  (11-30)] 


<7 


2 

e2 


-2t\ 


12 


r  [n/2] 

)<£> 

1/2I12(0)+  3^ 

II  (k) 

k=l 

sin  (27rk/N) 

(IV-  12  ) 


[We  have  assumed  that  the  factors  of  2  shown  in  the  resonator  transfer  functions  of  Fig.  18  are 
lumped  with  the  multiplications  by  H(k).] 

Thus,  for  the  case  of  direct  form  resonators,  the  total  output  noise  variance  is 


-2t\ 


12 


[N/2] 

[N/2]  ?  1 

) 

H2(0)  +  Y 

H2(k)  +  $Y  "  <k) 

/  o 

k  =  l 

sin^  (27rk/N) 

(IV-  1 3 ) 


For  the  case  of  a  low-pass  filter,  where  the  first  few  H(k)  are  in  the  passband,  the  first  few 

2  2 

terms  in  the  last  sum  above  will  comprise  most  of  <7^,  due  to  the  factors  l/sin  (27rk/N)  which 
increase  sharply  for  decreased  k. 

For  the  case  of  coupled  form  resonators,  we  have,  using  (11-43), 


/  -?t\ 

[N/2] 

(V)  <i» 

1/2H2(0)+  4^  H2(k) 

L  k=l 

(IV-  14) 
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and  adding  (lV-ll)and  (IV- 14) 


a 


2 

e 


[N/2] 

H2(0)  +  5V  II2(k) 
k=l 


(IV-  15) 


There  are  no  longer  any  terms  with  a  l/sin  (2jrk/N)  dependence,  and  thus  for  low- pass  frequency 
sampling  filters  there  will  generally  be  less  total  noise  variance  with  coupled  form  resonators 
than  with  direct  form  resonators. 

We  have  not  directly  verified  experimentally  our  noise  formulas  for  frequency  sampling 

20-22 

filters.  However,  based  on  the  measurements  performed  by  others  on  noise  in  resonators, 

we  can  have  reasonable  confidence  in  our  models.  Actual  measurement  of  noise  variance  in  a 
frequency  sampling  filter  is  a  difficult  task,  since  the  resonators  are  very  lightly  damped,  yield¬ 
ing  output  noise  with  high  correlation  from  sample- to- sample.  Measurement  times  needed  to 
obtain  a  consistent  estimate  of  the  mean- squared  output  noise  are  prohibitive  in  such  a  case. 

Let  us  compare  (IV-13)  and  (IV-15)  numerically,  for  our  low-pass  filter  example  of  the  pre¬ 
ceding  section.  [  II(k  )  =  (-  1  )k,  k  =  0,  1 . 31;  11(32)  =  0.7;  11(33)  =- 0.225;  11(34  )  =  0. 01  995; 

II(k)  =  0,  k  =  35,  36,  ....  N/2;  N  =  256.]  We  assume  <5  =  2^,  the  choice  specified  in  Table  II  for 
18-bit  coefficients.  For  direct  form  resonators,  (IV-13)  yields 

cr  2  =  (345295  )2-  2t  (IV-l6a) 

or,  in  terms  of  bits  of  rms  output  noise, 

1/2  log2  (22tu2)  =  9.20  bits  .  (IV-l6b) 

Due  to  the  l/sin  (2trk/N)  factors,  the  first  three  terms  of  the  last  sum  in  (IV-13)  contribute 
83.8  percent  of  the  noise  variance  given  in  (IV-l6a).  For  the  case  of  coupled  form  resonators. 
(IV-  1 5 )  yields 

a2  =  (6771  )2"  2t  (IV-  17a) 

or  in  terms  of  bits, 

1/2  log2  (22tu2)  =  6.36  bits  .  (IV-  17b ) 


Thus,  for  this  particular  filter,  the  coupled  form  yields  a  decided  advantage,  about  3  bits  lower 
rms  output  noise.  However,  for  a  filter  of  a  band-pass  nature,  we  would  probably  wish  to  use 
direct  form  resonators.  For  a  band-pass  filter,  the  noise  produced  using  the  direct  form  would 
be  approximately  the  same  as  with  the  coupled  form,  and  the  direct  form  would  have  the  advan¬ 
tage  of  increased  computational  speed. 

In  the  above  comparison,  we  have  not  mentioned  the  overflow  constraint.  However,  for  the 
case  of  a  frequency  sampling  filter,  the  output  signal  power  that  can  be  attained  while  satisfying 
the  overflow  constraint  is  essentially  the  same  for  direct  and  coupled  form  resonators,  so  that 
the  comparison  above  is  fair. 

To  prevent  overflow  in  a  frequency  sampling  filter,  we  require  that  (1 )  the  comb-filter  does 
not  overflow,  and  (2)  that  none  of  the  resonators  overflow.  If  these  requirements  are  satisfied, 
then  any  overflow  caused  by  summing  the  outputs  of  the  resonators  can  be  prevented  by  attenuators 
at  the  resonator  outputs.  Since  these  final  attenuators  act  on  both  signal  and  noise,  they  would 
not  affect  the  output  noise- to- signal  ratio. 
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For  the  comb- filter,  we  have  2  |h(n)|  <:  2,  so  that  overflow  can  be  prevented  by  restrict- 

n=0 

ing  | x (n ) |  <  l/2.  To  prevent  overflow  in  the  resonators,  we  insert  an  attenuator  between  the 
comb-filter  and  resonator  bank,  as  indicated  in  Fig.  18.  Without  the  attenuator,  the  k1^  comb¬ 
filter-resonator  combination  has  z-transform 


H  (z) 
cr 


and  impulse  response 


h  (n)  = 
cr 


(1  —  r^z-^')  [1  —  r  cos  (27rk/N)  z  *] 

1  —  2r  cos  (2Trk/N)  z  *  +  r2z  2 


n  ,  2  jrkn  . 
r  cos  (  N  ) 


n  =  0,  1 ,  .  .  .  ,  N  -  1 

otherwise 


It  can  easily  be  verified  that 


(IV- 18) 


T,  |hcr(n)|<N  (IV- 19) 

n=0 


so  that  with  |x(n)|  <  l/2,  an  attenuator  2/N  will  prevent  overflow  in  the  resonator.  Another 
way  of  formulating  the  condition  for  no  overflow  in  a  frequency  sampling  filter,  is  to  revise 
Fig.  18  so  that  an  attenuator  l/2  precedes  the  comb-filter,  and  an  attenuator  2/N  follows  the 
comb-filter.  Then,  our  restriction  on  the  input  signal  is  simply  |x(n)|  <  1. 

Given  these  conditions,  we  can  obtain  expressions  for  output  noise- to- signal  ratio  in  our 
filter.  As  usual,  the  noise- to- signal  ratio  depends  on  the  type  of  input  signal  used.  We  con¬ 
sider  a  white  input,  with  first-order  probability  density  function  uniform  in  (—1,  1),  so  that 
crx  =  l/3.  The  output  signal  variance  can  be  expressed  as 


a 


2 

y 


2  (N/2]  2 

ir(0)  +  22  11  (k) 

_ k^l _ 

N 


(IV-  20 ) 


For  the  filter  example  we  have  been  considering. 


2  64  2  _ 

y  256  CTx  12 


o 


(IV-21 ) 


Since  the  filter  is  fairly  wideband,  a  sizable  portion  of  the  input  signal  power  passes  through  the 
filter.  These  formulas  are  valid  for  both  direct  and  coupled  form  resonators,  and  can  be  used 
in  conjunction  with  the  above  results  for  output  noise  variance  to  yield  formulas  for  output  noise- 
to- signal  ratio. 

By  using  ( 1 V- 1 6 )  and  (IV-21  ),  the  rms  output  noise- to- signal  ratio,  for  direct  form  resona¬ 
tors,  is  (expressed  in  units  of  bits) 


1/2  log2 


10.99  bits 


( IV—  22) 


By  using  (IV- 17)  and  (IV-21),  the  corresponding  result  for  coupled  form  resonators  is 


1/2  log2 


=  8.15  bits 


(IV-  2  3 ) 
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C.  SUMMARY  OF  QUANTIZATION  EFFECTS  IN  FREQUENCY  SAMPLING  FILTERS 

A  frequency  sampling  filter  is  a  realization  of  a  finite  duration  impulse  response  filter  by 
means  of  a  comb-filter  and  a  bank  of  resonators.  This  realization  technique  is  quite  flexible, 
since  any  finite  duration  impulse  response  filter  can  be  synthesized  in  this  way.  The  design  of 
such  a  filter  can  be  accomplished  by  specifying  samples  of  the  filter's  frequency  response; 
these  samples  correspond  to  constant  multipliers  to  be  applied  to  the  outputs  of  the  resonators. 
For  a  band-  or  low-pass  filter,  where  the  out-of-band  frequency  samples  are  specified  to  be 
zero,  the  number  of  resonators  needed  may  be  modest,  and  thus  the  amount  of  computation 
time  needed  to  compute  each  output  point  may  compare  favorably  with  the  time  needed  in  a  direct 
convolution,  or  in  an  FFT  realization.  The  frequency  sampling  realization  can  also  be  used  to 
synthesize  a  bank  of  band-pass  filters.  One  comb-filter  can  be  shared  by  all  the  resonators  in 
the  bank. 

Quantization  problems  in  a  frequency  sampling  filter  are  quite  severe,  due  to  the  need  for 
pole- zero  cancellations  near  the  unit  circle.  Because  of  coefficient  quantization,  the  pole- zero 
cancellations  will  be  imperfect.  This  can  lead,  for  example,  to  large  passband  ripples  in  the 
filter's  frequency  response.  For  a  particular  low-pass  filter  example,  experiments  have  indi¬ 
cated  that,  with  direct  form  resonators,  16-bit  coefficients  (fixed  point)  are  needed  to  keep  pass- 
band  ripple  tolerably  low.  If  the  resonators  are  realized  using  the  coupled  form,  so  that  the 
sensitivity  to  coefficiency  errors  of  the  resonant  frequency  of  the  low- resonant- frequency  poles 
is  reduced,  then  14  bits  seem  sufficient. 

Since  high-gain  resonators  are  required  in  the  frequency  sampling  realization,  the  effects 
of  roundoff  noise  and  the  overflow  constraint  are  also  quite  severe.  Thus,  quite  accurate 
arithmetic  would  be  needed  to  maintain  a  suitably  low  noise- to- signal  ratio.  For  the  fixed  point 
case,  a  noise-to-signal  ratio  analysis  was  carried  out,  and  numerical  results  were  obtained  for 
our  low-pass  filter  example.  Expressing  the  results  in  decibels,  we  have  from  (IV-22), 


( IV  —24) 


if  direct  form  resonators  are  used.  To  maintain  N/S<  — 60dB,  we  would  need  a  word  length 
t  =  21  bits.  For  coupled  form  resonators,  the  situation  is  somewhat  improved,  as  from  ( IV- 2  3) 
we  have 


(1V-25) 


so  that  to  maintain  a  —  60-dB  N/S,  t  =  18  bits  would  approximately  suffice. 

We  should  comment  that  the  advantage  of  coupled  form  resonators,  with  respect  to  both 
coefficient  sensitivity  and  noise-to-signal  ratio,  holds  only  for  low-pass  filters.  For  a  band¬ 
pass  filter,  the  effects  of  quantization  for  direct  and  coupled  form  resonators  are  quite  similar, 
and  we  would  probably  use  the  direct  form  to  obtain  greater  computational  speed. 
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V.  FFT  FILTERS 


In  this  section,  wc  consider  quantization  effects  in  the  class  of  digital  filters  which  have  an 
impulse  response  of  finite  duration  and  are  implemented  by  means  of  convolution  sums  performed 
using  the  discrete  Fourier  transform  (DFT).  The  output  samples  of  such  a  filter  are  obtained 
from  the  results  of  N-point  circular  convolutions  of  the  filter  impulse  response  (kernel)  with  sec¬ 
tions  of  the  input.  These  circular  convolutions  arc  obtained  by  computing  the  DFT  of  the  input 

section,  multiplying  by  the  DFT  of  the  impulse  response,  and  inverse  transforming  the  results. 

2  7 

Stockham  '  has  discussed  procedures  for  utilizing  the  results  of  these  circular  convolutions  to 
perform  linear  convolutions,  methods  for  piecing  together  sections  of  output,  rationales  for 
choosing  the  transform  length  N,  and  speed  advantages  over  direct  convolution  to  be  gained  by 
using  the  FFT  to  implement  the  required  DFT's.  We  shall  not  discuss  the  details  of  these  issues, 
but  concern  ourselves  with  the  effects  of  quantization  on  the  basic  computation,  involving  an 
FFT  of  input  section,  multiplication  by  DFT  of  kernel,  and  inverse  FFT.  The  additional  error 
caused  by  piecing  sections  together  will  generally  be  negligible  compared  with  the  errors  due  to 
the  FFT-multiply-IFFT  (inverse  fast  Fourier  transform).  We  shall  restrict  attention  to  radix 
2  FFT  algorithms. 

Any  filter  with  finite  duration  impulse  response  can  be  implemented  via  the  route  of  the  FFT, 
as  well  as  by  the  frequency  sampling  technique  given  in  Sec.  IV.  In  working  out  numerical  re¬ 
sults  in  this  section,  we  shall  use  the  same  prototype  filter  used  in  Sec.  IV.  This  filter  has  an 
impulse  response  of  duration  2  56  samples,  and  we  shall  assume  in  its  FFT  implementation  that 
512-point  circular  convolutions  are  used,  so  that  an  input  section  might  consist  of  256  input 
values  augmented  by  256  zeros. 

We  shall  draw  heavily  in  this  section  on  the  results  presented  in  Sec.  Ill  on  quantization  in 
the  FFT.  A  few  of  the  issues,  such  as  the  arranging  of  scaling  to  prevent  overflow,  change 
somewhat  for  the  case  of  an  FFT  filter. 

Most  of  our  attention  here  will  be  confined  to  the  fixed  point  case.  To  detail  the  way  in 
which  matters  change  for  a  block  floating  or  floating  point  realization  would  be  to  repeat  a  great 
deal  of  our  preceding  discussion  on  the  FFT. 

A.  COEFFICIENT  QUANTIZATION 

Because  of  coefficient  quantization,  the  impulse  response  and  frequency  response  of  an 
FFT  filter  will  deviate  from  the  specified  design.  Analytical  treatment  of  this  problem  presents 
many  difficulties.  The  analysis  in  Sec.  III-A,  which  dealt  with  mean-squared  error  due  to  coef¬ 
ficient  quantization  in  an  FFT,  is  not  particularly  useful  for  this  problem.  We  would  be  more 
interested  in  how  quantities  such  as  the  maximum  passband  ripple  and  maximum  sidelobe  depend 
on  coefficient  quantization,  than  in  a  mean-squared  error  criterion.  We  have  not  been  able  to 
analyze  theoretically  these  effects.  One  of  the  problems  is  that  the  impulse  response  of  an  FFT 
filter  (with  quantized  coefficients)  is  not  well  defined,  for  the  response  of  an  FFT  filter  will,  in 
general,  change  (not  merely  by  a  time  translation)  if  the  position  of  the  input  pulse  is  shifted. 
When  the  impulse  is  shifted,  different  coefficients,  with  different  quantization  errors,  will  be 
used  in  computing  the  FFT. 

For  these  reasons,  we  have  used  an  empirical  approach  in  estimating  the  effects  of  coeffi¬ 
cient  quantization  on  an  FFT  filter.  The  low-pass  filter  design  specified  in  Sec.  IV -A  was  used 
as  a  prototype.  The  512-point  FFT  of  an  impulse  input  was  computed,  using  coefficients 
rounded  to  a  chosen  number  of  bits  M  (18  or  less)  and  36-bit  arithmetic  to  minimize  the  effect 
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of  roundoff  noise.  The  resulting  sequence  was  multiplied  by  a  512-point  DFT  of  the  filter  im¬ 
pulse  response.  In  obtaining  this  filter  function,  the  256-point  filter  impulse  response  was  ob¬ 
tained  by  inverse  transforming  the  frequency  samples  specified  in  Sec.  IV -A;  256  zeros  were 
appended,  and  a  512-point  DFT  was  computed  (36-bit  arithmetic  and  coefficients  were  used  in 
these  computations,  but  the  resulting  512-point  sequence  was  rounded  to  M  bits).  The  product 
sequence  (DFT  of  input  section  times  DFT  of  filter)  was  then  inverse  transformed  using  M-bit 
coefficients  and  36-bit  arithmetic.  This  yielded  a  filter  impulse  response,  corrupted  due  to 
coefficient  quantization.  To  obtain  a  frequency  response,  this  impulse  response  was  lengthened 
to  2048  points  by  appending  zeros,  and  a  2048-point  |DFT|  was  computed  using  36-bit  arithme¬ 
tic  and  coefficients.  All  the  above  arithmetic  was  fixed  point. 

As  a  control,  M  was  initially  set  to  36  bits,  and  the  "ideal"  frequency  response  specified 
in  Fig.  19  was  obtained.  As  M  was  reduced,  measurements  showed  the  maximum  sidelobe  and 
passband  ripple  to  increase.  Scope  pictures  showed  that  the  degradation  in  frequency  response 
appeared  in  the  form  of  noise  distributed  fairly  uniformly  over  the  entire  frequency  axis.  This 
was  in  contrast  to  the  results  for  frequency  sampling  filters,  where  spurious  ripples  in  the  fre¬ 
quency  response  clustered  near  the  frequencies  of  a  few  uncanceled  resonator  poles. 

The  experimental  results  are  summarized  in  Table  III,  which  show  maximum  sidelobe  and 
passband  ripple  as  a  function  of  coefficient  register  length  (including  sign).  Recall  that  larger 
numbers  in  the  table  correspond  to  lower  sidelobes  and  ripple.  For  each  register  length,  sev¬ 
eral  measurements  of  sidelobe  and  ripple  were  made,  corresponding  to  different  positions  (in 
time)  of  the  input  impulse.  Only  slight  variations  of  sidelobe  and  ripple  with  impulse  position 
were  observed,  and  the  results  shown  in  the  table  represent  an  average  over  the  measurements 
made. 

We  can  compare  Table  III  with  the  corresponding  results  given  in  Table  II  for  the  frequency 
sampling  realization.  With  respect  to  passband  ripple,  the  FFT  filter  generally  gives  better 
performance  for  equal  register  length.  For  the  FFT  realization,  we  do  not  have  the  problem 
of  uncanceled  poles  in  the  passband,  and  the  register  length  must  be  reduced  quite  drastically 
before  the  uniform  noise  in  the  filter's  frequency  response  becomes  comparable  in  level  to  the 
passband  ripples  which  were  already  present  in  the  originally  designed  filter.  However,  the 
frequency  sampling  realization  generally  has  somewhat  better  sidelobe  rejection,  for  equal  reg¬ 
ister  length.  A  partial  explanation  for  this  is  that  in  a  frequency  sampling  realization  with  re¬ 
duced  register  length,  the  zeros  of  the  comb-filter  can  still  be  kept  fairly  close  to  the  unit  circle, 
maintaining  the  stopband  attenuation.  In  the  FFT  filter,  the  noise  level  in  the  stopband  of  the 
frequency  response  increases  fairly  uniformly  with  decreased  register  length. 

B.  A  BOUND  ON  THE  OUTPUT  OF  A  CIRCULAR  CONVOLUTION1 


Here,  we  derive  a  bound  on  the  output  of  an  N-point  circular  convolution.  This  bound  is 
useful  in  arranging  scaling  of  the  array  to  prevent  overflow  in  an  FFT  filter.  The  bound  is  de¬ 
rived  for  general  filter  characteristics,  and  thus  would  be  useful,  for  example,  in  setting  up  a 
program  or  hardware  to  implement  an  FFT  filter  with  arbitrary  frequency  response  character¬ 
istics.  For  specific  filter  characteristics,  bounds  can  be  derived  which  are  lower  than  the  gen¬ 
eral  bound.  In  addition  to  the  general  bound,  we  shall  evaluate  a  specific  bound  for  the  low-pass 
filter  example  of  the  preceding  sections.  In  deriving  the  general  bound,  the  style  of  this  section 
will  become  somewhat  more  mathematical  than  that  of  the  remainder  of  the  report. 


jMuch  of  this  work  hos  been  reported  on  previously  by  Oppenheim  and  the  author. 
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TABLE  III 

PERFORMANCE  OF  FFT  FILTER  WITH  QUANTIZED  COEFFICIENTS 

Register  Length 

Maximum  Sidelobe 
(dB) 

Maximum  Passband 
Ripple 
(dB) 

36 

84.57 

33.23 

18 

83.45 

33.23 

16 

80.00 

33.22 

14 

79.30 

33.22 

12 

66.44 

33.21 

10 

55.  34 

33.20 

8 

42.31 

29.46 

1.  Problem  Statement 

According  to  the  above  discussion,  we  would  like  to  determine  an  upper  bound  on  the  maxi¬ 
mum  modulus  of  an  output  value  that  can  result  from  an  N-point  circular  convolution.  With 
{x(n)}  denoting  the  input  sequence,  {h(n)}  denoting  the  kernel,  and  {y(n)}  denoting  the  output 
sequence,  we  have 

N-l 

y(n)  =  V  x(k)  h[(n  —  k)  mod  N]  n  =  0,  1, .  .  .  ,  N  —  1  ( V — 1 ) 

k=0 


where  it  is  understood  that,  in  general,  each  of  the  three  sequences  may  be  complex.  The  cir¬ 
cular  convolution  is  accomplished  by  forming  the  product 

Y (k)  =  H(k)  X(k)  (V-2) 

where 


N-l 

X(k)  =  ±  £  x(n)Wnk  k  =  0,  1,  .  .  .  ,N  —  1  ( V  —  3 ) 

n=0 


N-l 

Y(k)=j|  £  y(n)Wnk  k=  0,1 . N-l  (V-4) 

n=0 


and 


N-l 

H(k)  =  £  h(n)  Wnk  k  =  0, 1 . N-  1  (V-5) 

n=0 


with  W  defined  as  W  = 


e-j2*/N 
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We  imagine  the  computation  to  be  carried  out  on  fixed  point  fractions.  Thus,  we  bound  the 
input  values  so  that 

|x(n)|<l  .  (V— 6 ) 

By  virtue  of  (V-3),  we  are  then  assured  that 
|X(k)|  4?  1 

so  that  the  values  of  X(k)  do  not  overflow  in  the  fixed  point  work.  If  we  are  dealing  with  a  radix 
2  FFT  algorithm,  the  factor  l/N  in  (V-3)  can  be  incorporated  as  scalings  by  i/2  at  each  array. 
As  discussed  in  Sec.  III-C,  these  scalings  will  guarantee  no  overflows  in  the  intermediate  arrays 
and  will  keep  the  signal  level  as  high  as  possible  as  we  proceed  through  the  computation. 

In  the  typical  case,  the  sequence  h(n)  is  known  and,  consequently,  so  is  the  sequence  II(k). 
Therefore,  it  is  not  necessary  to  continually  evaluate  (V-5);  that  is,  the  sequence  H(k)  is  com¬ 
puted,  normalized,  and  stored  in  advance.  Thus,  it  is  reasonable  to  only  apply  a  normalization 
to  H(k)  and  not  to  h(n),  so  that  we  require  that 


I H  (k)  |  «  1 


(V-7) 


A  normalization  of  the  transform  of  the  kernel  so  that  the  maximum  modulus  is  unity  allows 
maximum  energy  transfer  through  the  filter,  consistent  with  the  requirement  that  Y(k)  does  not 
overflow  the  register  length. 

Our  objective  is  to  obtain  an  upper  bound  on  |y(n)|  for  all  sequences  {x(n)}  and  { H (k)}  con¬ 
sistent  with  (V-6)  and  (V-7).  This  bound  will  specify,  for  example,  the  scaling  factor  to  be  ap¬ 
plied  in  computing  the  inverse  of  (V-4)  to  guarantee  that  no  value  of  y(n)  overflows  the  fixed 
point  word.  The  results  which  we  will  obtain  are  that: 

(a)  With  the  above  constraints,  the  result  of  the  N-point  circular  convolution 
of  (V-l)  is  bounded  by  |y(n)|  ■$  n/N. 

(b)  In  the  general  case,  where  {x(n)}  and  {h(n)}  are  allowed  to  be  complex, 
the  bound  in  A  is  a  least  upper  bound.  This  will  be  shown  by  demon¬ 
strating  a  sequence  that  can  achieve  the  bound. 

(c)  If  we  restrict  {x(n)}  and/or  {h(n)}  to  be  real-valued,  the  bound  of  A  is  no 
longer  a  least  upper  bound  for  every  N.  However,  the  least  upper  bound 
/3(N)  is  itself  bounded  by  Vn/2  <  0(N)  <n/N. 


2.  Derivation  of  Results 
Proof  of  (a) 

Parseval's  relation  requires  that 


and 


N-l 

z 

|y(n)|2  =  N 

N-l 

z 

|Y(k)|2 

(V-8) 

n- 0 

k=0 

N-l 

Z 

|x(n)|2  =  N 

N-l 

z 

|X(k)j2  . 

(V-9) 

n=0 

k=0 
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By  substituting  (V-2)  into  (V-8),  and  using  (V-7), 


N-l  N-l 

I  |y(n)|2<N  V  |X(k)|2 
n-0  k=  0 

or,  using  (V-9), 

N-i  N-l 

Y  ly(n)|2<  Yi  lx<n>|2 

n=0  n=0 

with  equality,  if  and  only  if  |  H(k)  ]  =  1.  However,  (V-6)  requires  that 
N-l 

Y  |x(n)|2^N 

n=0 

with  equality,  if  and  only  if  |x(n)|  =  1.  By  combining  (V-ll)  and  (V-12), 
N-l 

Y  |y(n)|2<N  . 

n=0 

But, 

N-l 

|y(n)|2^  Y  ly(°)l2 

n=0 


(V-10) 


(V-ll) 


(V-12) 


(V-l  3) 


(V-l  4 ) 


and,  therefore, 

\y(n)\4’JN  .  (V-l  5) 

Proof  of  (b) 

To  show  that  x/TT  is  a  least  upper  bound  on  |y(n)|  ,  we  review  the  conditions  for  equality  in 
the  inequalities  used  above.  We  observe  that  for  equality  to  be  satisfied  in  (V-l  5),  it  must  be 
satisfied  in  (V-ll),  (V-12),  and  (V-14),  requiring  that 

(1)  |  H  (k )  |  =  1 

(2)  |  x(n) |  =  1 

(3)  Any  output  sequence  y(n)  which  has  a  point  whose  modulus  is  equal  to 
can  contain  only  one  nonzero  point. 

The  third  requirement  can  be  rephrased  as  a  requirement  on  the  input  sequence,  and  on  the 
sequence  H(k).  Specifically,  if  the  output  sequence  contains  only  one  nonzero  point,  then  Y(k) 
for  this  sequence  must  be  of  the  form 

n  k  _ 

Y(k)  =  AW  °  =  1  A  |  exp[-j<^  nQk  +  p)] 


where  p  is  a  real  constant  and  n^  is  an  integer  so  that,  from  (V-2), 
H(k)  X(k)  =  |  A  |  exp[-j(^  nQk  +  p)] 


(V-16) 
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We  can  express  H(k)  and  X(k)  as 


and 


H(k)  =  e 


^k 


(V-17a) 


i0, 

X(k)  =  |X(k)  |  e 

where  we  have  used  the  fact  that  | H(k) |  =  1.  For  (V-l  6)  to  be  satisfied,  then 
|X(k)|  =  |  A  | 

and 

_  2lr  i 

9  ,  =  —  0,  -  -r^nk— p 

'  k  k  N  o  r 


(V-l 7b) 


(V-18) 


Therefore,  requirement  (3)  can  be  replaced  by  the  statement  that: 

(3)'  | X(k)  |  =  constant  and  the  phase  of  H(k)  be  chosen  to  satisfy  (V-18). 

As  an  additional  observation,  we  note  that  for  any  input  sequence  {x(n)}. 


N-l 

I y(n) |  <  Yj  lH<k>l  lX<k>! 
k=0 


with  equality  for  some  value  of  n,  if  and  only  if  |  H(k)  |  =  1  and  the  phase  of  H(k)  is  chosen  on  the 
basis  of  (V-18).  Therefore,  for  any  x(n)  the  output  modulus  is  maximized  when  H(k)  is  chosen 
in  this  manner.  However,  this  maximum  value  will  only  equal  n/N  if,  in  addition,  |x(n)|  =  1  and 
|  X(k)  |  =  constant. 

For  N  even,  a  sequence  having  the  property  that  |x(n)|  =  1  and  |X(k)|  =  constant  is 


x(n>  =  e-i<"2/N>  -  W* 


For  N  odd,  a  sequence  with  |x(n)|  =  1  and  j  X (k)  j  =  constant  is^ 


(V-l  9) 


x(n)  =  e_;>(2,rn  /N)  =  Wn  .  (V-20) 

39  k 

By  using  one  of  these  sequences  as  the  input,  and  choosing  H(k)  =  e  ,  with  9  ^  given  by  (V-18), 
equality  in  (V-l  5)  can  be  achieved  for  any  N.  Thus,  the  bound  given  in  statement  (a)  is  a  least 
upper  bound. 


Proof  of  (c) 

30 

Proof  of  (c)  is  omitted  here.  See  Oppenheim  and  Weinstein. 


3.  Application  of  General  Bounds 

In  a  fixed  point  FFT  filter,  we  wish  to  incorporate  scaling  in  the  computation  in  such  a  way 
that  we  are  guaranteed  never  to  overflow.  When  we  use  a  power  of  two  FFT  algorithm,  over¬ 
flows  in  the  computation  of  X(k)  will  be  prevented  by  including  a  scaling  of  1  / 2  at  each  stage, 
since  the  maximum  modulus  of  an  array  in  the  computation  is  nondecreasing  and  increases  by 

t  The  sequences  (V-l 9)  and  (V-20)  were  suggested  by  C.M.  Rader  of  Lincoln  Laboratory.  For  proof  of  their 
stated  properties,  see  Oppenheim  and  Weinstein.^ 
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at  most  a  factor  of  two  as  we  pass  from  stage  to  stage  (see  Sec.  Ill— A ) .  With  this  scaling,  the 
bound  just  derived  guarantees  that  scaling  is  not  required  in  more  than  half  the  arrays  of  the  in¬ 
verse  FFT  computation.  Therefore,  including  a  scaling  of  l/2  in  the  first  half  of  the  stages  in 
the  inverse  FFT  will  guarantee  that  there  are  no  overflows  in  the  remainder  of  the  computation. 
For  general  filter  characteristics,  no  less  scaling  will  suffice  to  guarantee  against  overflow,  and 
no  more  scaling  is  necessary.  The  fact  (c)  in  Sec.  2  above,  that  ft  ^  N/2,  indicates  that  if  we 
restrict  ourselves  to  real  input  data  or  if  h(n)  is  real,  at  most  one  rescaling  could  be  eliminated. 

The  bound  just  obtained  is  also  useful  if  the  FFT's  are  carried  out  using  a  block  floating 
point  strategy.  In  this  case,  arrays  are  scaled  by  a  factor  of  1  / 2  only  when  overflows  occur, 
but  then  a  final  rescaling  must  be  carried  out  after  each  section  is  processed  so  that  it  is  com¬ 
patible  with  the  results  from  previous  sections.  For  general  input  filter  characteristics,  the 
final  rescaling  can  be  chosen  based  on  the  bounds  given  here,  to  insure  that  the  output  will  not 
exceed  the  available  register  length. 


4.  Bounds  for  Specific  Filter  Characteristics 

The  bounds  derived  and  method  of  scaling  mentioned  above,  apply  to  the  general  case.  Ex¬ 
cept  for  the  normalization  of  (V-7),  they  do  not  depend  on  the  filter  characteristics.  This  is 
useful  when  we  wish  to  fix  the  scaling  strategy  without  reference  to  any  particular  filter.  For 
specific  filter  characteristics,  the  bound  can  be  reduced.  Specifically,  it  can  be  verified  from 
(V-l)  and  (V-6)  that,  in  terms  of  h(n), 

M-l 

|y(n)|<  £  |h(i)  |  (V-21 ) 

i-0 


where  M  denotes  the  length  of  the  impulse  response.  This  is  a  least  upper  bound  since  a  se¬ 
quence  {x(n)}  can  be  selected  which  will  result  in  this  value  in  the  output.  This  will  be  signifi¬ 
cantly  lower  than  the  bound  represented  in  (V-l 5)  if,  for  example,  the  filter  is  very  narrow 
band,  or  if  the  kernel  has  many  points  with  zero  value. 

The  bound  (V-21)  has  been  evaluated  for  the  low-pass  filter  example  which  was  specified  in 
Sec.  IV-A.  For  this  filter,  we  have  M  =  256,  and  the  impulse  response  is 


31 

=  256  h  E  (-1)’ 
k=0 


k  2trkn 
cos 


Z5S-  +  °’7  cos 


-  0.225  cos  +  0.01995  cos  2ir^^n 


(V-22) 


We  found  that 


255 

Yj  I  h(n)  |  =  3. 12  .  (V-23) 

n=0 

If,  as  suggested  previously,  the  filter  is  implemented  using  512-point  FFT's,  then  a  scaling  by 
1/2  at  just  the  first  two  stages  of  the  inverse  transform  will  suffice  to  prevent  overflow.  If  the 
general  bound  were  used  instead,  scaling  at  the  first  five  arrays  would  be  applied.  Thus,  use 
of  the  specific  bound  rather  than  the  general  bound  permits  the  signal  level  to  be  kept  signifi¬ 
cantly  higher,  implying  a  corresponding  improvement  in  output  noise-to-signal  ratio. 
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C.  ROUNDOFF  NOISE  -  FIXED  POINT 


Using  the  results  of  our  roundoff  noise  analysis  for  fixed  point  FFT,  we  can  obtain  expres¬ 
sions  for  output  noise  variance  and  output  noise-to-signal  ratio,  for  a  fixed  point  FFT  filter. 

Let  us  examine  the  basic  steps  in  the  filtering  computation,  tracing  the  buildup  of  noise  variance 
as  we  proceed.  As  in  the  preceding  sections,  we  shall  focus  on  a  prototype  filter  with  256-point 
impulse  response.  We  assume  rounded  arithmetic. 

First,  we  use  the  FFT  to  compute  the  DFT  of  a  section  of  input.  In  the  implementation  of 
our  prototype  filter,  which  has  a  256-point  impulse  response,  we  might  compute  a  512-point 
FFT,  where  the  input  consists  of  256  input  samples  and  256  zeros.  Actually,  we  could  handle 
512  real  input  samples  at  once,  by  placing  sections  of  256  real  samples  in  both  the  real  and  im¬ 
aginary  parts  of  the  input  to  the  FFT.  To  guarantee  against  overflow,  we  need  a  scaling  of  1/2 
at  each  of  the  9  stages  in  the  FFT,  yielding  an  overall  attenuation  of  1/512.  Our  requirement 
on  the  input  sequence  is  |x(n)|  <  1.  The  DFT  we  compute  will  be  corrupted  by  noise,  which  we 
call  E^(k).  This  noise  has  variance  [see  (III— 2-4 )  and  (III— 26)] 

=  S|E1(k)  |2  =  (5/3)2'Zt  .  (V -24 ) 

This  noise  variance  is  small,  because  most  of  the  roundoff  noise  has  been  shifted  off  by  the  atten¬ 
uations.  However,  due  to  the  scalings,  the  mean-squared  signal  has  decreased  by  a  factor  of 
1/512,  so  that  the  noise-to-signal  ratio  has  increased  significantly  in  passing  through  the  FFT. 

Next,  we  multiply  this  computed  transform  by  a  sequence  H(k)  representing  the  DFT  of  a 
512-point  sequence  h(n),  where  h(n)  consists  of  the  filter  impulse  response  and  256  zeros.  This 
multiplication  introduces  new  roundoff  noise  of  variance  2  /3.  If  we  assume  |H(k)|  <  1,  the 

mean-square  of  the  noise  which  was  already  present  due  to  the  first  FFT  gets  reduced  by  a 
factor 

511 

£  |H(k)|Z  =  B  (V-25) 

k=0 

representing  a  ratio  of  the  filter  bandwidth  to  the  sampling  frequency.  Thus,  after  the  multipli¬ 
cation,  the  mean-square  of  the  total  noise  E.,(k)  is  approximately 

^  +  B(5/3)2_Zt  .  (V-26) 

The  noise  is  not  white,  but  contains  a  component  whose  spectrum  has  been  shaped  by  the  filter. 
For  our  low-pass  filter  example,  B  =>  1/ 4 ,  so  that 

aZ  «(3/4)2~Zt  .  (V-27 ) 

^2 

2  2 

Note  that  cr  is  slightly  less  than  a  ,  and  represents  only  about  a  bit  of  noise.  If  the  signal 
h.2  Ei 

spectrum  is  flat,  the  mean-squared  signal  will  also  be  reduced  somewhat  due  to  the  multiplica¬ 
tion  by  H(k). 


t  H(k)  is,  in  general,  complex,  sa  that  four  real  multiplies  are  needed.  If  certain  points  an  H(k)  are  real 
ar  zera,  less  naise  variance  will  be  introduced. 
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Now  we  compute  an  inverse  transform  to  obtain  a  section  of  output.  The  noise  variance  at 
the  output  of  this  transform  depends  on  how  many  scalings  are  necessary  in  the  inverse  FFT. 
Let  us  consider  our  low-pass  filter  example,  and  assume  that  the  specific  bound  of  Sec.  V-B-4 
is  used,  so  that  only  two  scalings  (at  the  first  two  arrays)  are  necessary  in  the  inverse  trans¬ 
form.  Then,  in  propagating  through  the  IFFT,  the  mean-square  of  the  noise  E.,(k)  increases 
by  a  factor  of  512/16.  The  512  represents  the  gain  of  the  inverse  DFT,  and  the  1/ 1 6  is  due  to 
scalings.  The  variance  of  the  additional  output  E^(k)  noise  caused  by  roundoff  in  the  IFFT  can 
be  estimated  easily  via  the  method  of  Sec.  III-B.  The  result  is 


512  ,  512 


2  ,512 


=  CTB'(~ir  +  (- +  ~T  +•••+!)  =  (202.3)2 


8 


B  '  8 


512 

4 


,-2t 


(V-28) 


The  total  mean-squared  output  noise  is 

=  (512/I6)a|  +  «(226)2'2t  .  (V-29) 

By  representing  this  result  in  units  of  bits  of  rms  output  noise. 


|  log  2  (22t<^)  =  3-91  bits  .  (V-30) 

We  can  estimate  the  mean-squared  output  signal,  if  we  assume  specific  statistics  for  the 
input  signal  x(n).  Let  us  assume  x(n)  is  white,  with  real  and  imaginary  parts  uniformly  distrib¬ 
uted  in  (—1,1).  Its  mean-square  is  then 

ff2=  S|x(n)|2  =  2/3  . 

This  variance  goes  through  an  attenuation  of  1/512  in  the  first  FFT  (due  to  the  scalings  at  each 
array),  an  attenuation  of  B  =  1/4  due  to  the  multiplication  by  H(k),  and  a  gain  of  512/16  in  the 
inverse  transform.  Thus,  the  mean-squared  output  signal  is 

a2  =  (1/512JB  (512/I6)a2  =  .  (V-31) 

The  output  noise-to-signal  ratio  is 

(72/ct2  =  (21696)2_2t  (V-32) 

or  in  units  of  bits. 


\  log2  [22V2/<72)]  =  7,20  bits  .  (V-33) 

Comparing  this  result  with  the  corresponding  results  (IV-22)  and  (IV-23)  for  the  frequency 
sampling  realization,  we  see  that,  for  this  example,  the  FFT  filter  provides  somewhat  better 
noise-to-signal  ratio.  The  advantage  is  3.79  bits  if  direct  form  resonators  are  used  in  the  fre¬ 
quency  sampling  filter,  but  only  0.95  bit  if  coupled  form  resonators  are  used. 

D.  COMPARISON  OF  FFT  AND  FREQUENCY  SAMPLING  FILTER  REALIZATION 

One  of  the  stated  objectives  of  the  last  two  sections  was  to  compare,  on  the  basis  of  differ¬ 
ing  quantization  effects,  FFT  and  frequency  sampling  realizations  of  the  same  digital  filter.  The 
techniques  for  constructing  such  a  comparison  have  been  demonstrated,  and  have  been  applied 
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to  a  specific  design  example  for  which  (we  considered  the  case  of  fixed  point  arithmetic)  the  FFT 
realization  yielded  a  slightly  lower  output  noise-to-signal  ratio.  The  effects  of  coefficient  quan¬ 
tization  were  qualitatively  somewhat  different  for  the  two  realizations.  For  equal  register  length 
coefficients,  we  might  judge  either  realization  to  yield  better  performance,  depending  on  our 
criterion.  Considering  both  coefficient  quantization  and  noise,  we  would  slightly  prefer  the  FFT 
realization. 

A  more  general  comparison  of  the  FFT  and  frequency  sampling  realizations,  on  the  basis 
of  differing  quantization  effects,  seems  to  us  to  be  both  difficult  and  unprofitable.  Two  of  the 
difficulties  are  that  first,  a  different  comparison  needs  to  be  constructed  for  each  specific  filter 
design,  and  second,  it  is  not  clear,  in  general,  what  the  criterion  for  comparison  should  be. 

This  more  general  comparison  would  also  be  somewhat  unprofitable,  since  there  are  many  fac¬ 
tors  other  than  quantization  effects  which  must  be  considered  in  choosing  between  alternate  filter 
designs.  For  example,  a  filter  (of  relatively  long  impulse  response)  in  which  all  or  most  of  the 
frequency  samples  H(k)  were  nonzero  would  almost  certainly  be  realized  by  the  FFT  technique, 
due  to  an  advantage  in  computational  speed  over  the  frequency  sampling  method.  The  FFT  real¬ 
ization  is  a  versatile  method  for  the  implementation  of  finite  duration  impulse  response  filters, 
while  the  frequency  sampling  method,  although  general,  is  practical  in  only  the  special  cases 
where  all  but  a  few  of  the  frequency  samples  are  zero.  For  example,  the  frequency  sampling 
technique  would  be  quite  appropriate  for  the  realization  of  a  bank  of  fairly  narrow  band-pass  fil¬ 
ters.  As  a  general  synthesis  procedure  for  finite  duration  impulse  response  filters,  we  would 
almost  certainly  prefer  the  FFT  method. 

Apart  from  setting  forth  a  comparison,  a  perhaps  more  important  objective  of  Secs.  IV  andV 
was  to  demonstrate  how  the  techniques  and  results  developed  in  studying  quantization  effects  in 
simple  recursive  filters  and  the  FFT,  could  be  applied  to  the  analysis  of  more  complicated  sys¬ 
tems.  A  point  we  wished  to  make  was  that,  given  a  good  understanding  of  quantization  effects 
in  the  basic  algorithms  (simple  recursive  filters  and  the  FFT),  the  analysis  of  more  complicated 
systems  can  be  accomplished  straightforwardly. 
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VI.  CONCLUSIONS  AND  PROBLEMS  FOR  FURTHER  STUDY 


In  Sec.  A  below,  we  now  summarize  the  report  and  review  the  important  results.  Then,  in 
Sec.  B,  we  mention  some  problems  which  have  not  been  resolved  here  and  which  seem  worthy 
of  further  study. 

A.  SUMMARY  AND  REVIEW  OF  RESULTS 

We  began  Sec.  I  by  outlining  the  problem  area  of  quantization  effects  in  digital  filters;  this 
outline  is  presented  in  brief  form  at  the  end  of  Sec.  I-A.  After  reviewing  previous  work  relevant 
to  the  problem  area  (Sec.  I-B),  we  devoted  Sec.  I-C  to  a  study  of  roundoff  error  models,  both 
fixed  and  floating  point.  The  basic  statistical  roundoff  error  models,  which  are  used  through¬ 
out  the  report,  were  set  down  and  a  few  experiments  testing  some  of  the  assumptions  of  the  mod¬ 
els  were  reported.  In  general,  the  experimental  results  were  in  accord  with  the  assumptions 
of  the  roundoff  error  models. 

Section  II  concentrated  on  quantization  effects  in  simple  (first-  and  second-order)  recursive 
filters.  The  analysis  of  coefficient  quantization  was  reviewed  for  the  fixed  point  case,  and  car¬ 
ried  over  to  the  floating  point  case.  We  found  that  when  the  range  of  coefficient  magnitudes  was 
not  too  large,  floating  point  provided  little  advantage  over  fixed  point  with  respect  to  the  coeffi¬ 
cient  quantization  problem.  Thus,  for  many  practical  filters,  we  might  well  consider  using  fixed 
point  coefficients  (thus  saving  the  bits  needed  for  the  coefficient  exponents)  even  if  floating  point 
arithmetic  is  to  be  employed. 

The  effects  of  roundoff  noise  and  the  overflow  constraint  were  considered  jointly  to  derive 
output  noise-to-signal  ratios  for  fixed,  floating,  and  block  floating  point  filter  realizations.  For 
the  floating  point  case,  experiments  verifying  the  theory  were  presented;  for  the  fixed  point  case, 
the  use  of  a  clamp  to  prevent  overflow  was  investigated. 

For  both  fixed  and  floating  point  arithmetic,  we  compared  (on  the  basis  of  differing  output 
noise-to-signal  ratios)  direct,  canonic,  and  coupled  form  digital  resonator  realizations.  We 
found  that  for  a  filter  with  low  resonant  frequency,  the  coupled  form  yields  a  lower  noise-to- 
signal  ratio  than  either  the  direct  or  canonic  forms,  which  yield  about  equal  noise-to-signal 
ratios. 

For  the  case  of  a  white  input  signal,  we  compared  noise-to-signal  ratios  for  fixed,  floating, 
and  block  floating  point  realizations  of  the  same  filter.  We  found  that,  as  the  gain  of  the  filter 
increases,  the  noise-to-signal  ratio  for  fixed  point  increases  at  a  greater  rate  than  for  floating 
or  block  floating  point.  This  comparison,  however,  is  sensitive  to  the  assumption  of  white  input 
signal,  since  the  fixed  point  noise-to-signal  ratio  depends  quite  strongly  on  the  type  of  input  sig¬ 
nal.  However,  the  methods  used  for  deriving  the  noise-to-signal  ratios  can  be  applied  for  any 
type  of  input  signal. 

In  Sec.  Ill,  we  studied  quantization  effects  for  the  other  basic  algorithm  of  digital  filtering, 
the  FFT.  The  results  are  summarized  in  detail  in  Sec.  III-E,  so  only  brief  comments  will  be 
given  here.  For  fixed  point,  we  found  that  the  effect  of  coefficient  quantization  was  to  produce 
a  mean-squared  output  error-to-signal  ratio  proportional  to  v=  log2N,  whereas  roundoff  noise 
caused  a  noise-to-signal  ratio  proportional  to  N.  Thus,  for  fixed  point,  it  would  be  reasonable 
to  maintain  fewer  bits  in  the  coefficient  accuracy  than  in  the  accuracy  of  the  arithmetic.  We 
found  that  for  signals  with  a  fairly  flat  spectrum,  a  block  floating  point  strategy  could  yield 
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significant  improvements  in  output  noise-to-signal  ratio,  over  a  fixed  point  strategy.  The  mean- 
squared  output  noise-to-signal  ratio  for  a  floating  point  FFT  of  a  white  signal  was  derived  (using 
a  statistical  roundoff  error  model),  and  found  to  be  proportional  to  u  -  log.,N.  This  result  dis¬ 
agreed  with  the  experimental  results  of  previous  workers,  who  measured  noise-to-signal  ratios 
2 

proportional  to  v  .  We  found,  however,  that  if  "randomized"  rounding  (see  See.  Ill- ID- 3 )  were 

used,  experimental  results  proportional  to  v  were  obtained.  With  " nonrandomized"  rounding, 

2 

or  truncation,  our  measured  noise-to-signal  ratios  seemed  to  be  proportional  to  u  ,  in  agree¬ 
ment  with  the  results  of  previous  workers. 

Section  IV  was  devoted  to  quantization  effects  in  frequency  sampling  filters,  and  most  of  our 
attention  was  restricted  to  the  fixed  point  case.  The  coefficient  accuracy  problem  was  studied 
in  some  detail  For  a  particular  filter  example,  we  measured  sidelobe  level  and  passband  rip¬ 
ple,  as  a  function  of  the  number  of  bits  used  for  the  coefficients.  We  showed  that  for  a  low-pass 
frequency  sampling  filter,  the  use  of  coupled  form  rather  than  direct  form  resonators  signifi¬ 
cantly  reduced  the  sensitivity  to  coefficient  quantization.  We  applied  the  results  of  See.  II  to  de¬ 
rive  output  noise-to-signal  ratio  for  a  fixed  point  frequency  sampling  filter.  We  found  that,  for 
a  low-pass  filter,  a  lower  output  noise-to-signal  ratio  was  obtained  with  coupled  form  resonators 
than  with  direct  form  resonators. 

In  Sec.  V,  we  analyzed  quantization  effects  in  FFT  filters.  The  coefficient  quantization  prob¬ 
lem  was  investigated  empirically.  For  the  same  low-pass  filter  example  studied  in  Sec.  IV,  we 
measured  sidelobe  level  and  passband  ripple  as  a  function  of  the  number  of  bits  retained  in  the 
eoeffieients.  Compared  with  the  frequency  sampling  realization,  the  FFT  filter  yields  lower 
passband  ripple  but  somewhat  higher  sidelobe  level,  for  equal  word  length  coefficients. 

One  of  the  important  questions  in  setting  up  a  fixed  point  FFT  filter  realization  is  how  to  ar¬ 
range  the  sealing  of  the  array  to  prevent  overflow,  yet  minimize  output  noise-to-signal  ratio. 

We  derived  (See.  V-B)  a  general  bound  on  the  output  of  a  circular  convolution,  which  aids  in  ar¬ 
ranging  the  scaling,  for  arbitrary  filter  characteristics.  We  also  gave  a  bound  which  is  appli¬ 
cable  for  specific  filter  characteristics,  and  is  usually  lower  than  the  general  bound.  Using  this 
specific  bound,  we  showed  how  scaling  would  be  arranged  in  the  low-pass  filter  example  men¬ 
tioned  above. 

With  this  sealing  scheme  set,  we  applied  the  techniques  of  Sec.  Ill  to  derive  output  noise- 
to-signal  ratio  for  a  fixed  point,  FFT  realization  of  our  low-pass  filter  example.  The  derived 
noise-to-signal  ratio  was  somewhat  lower  than  for  a  frequency  sampling  realization  of  the  same 
filter. 

B,  PROBLEMS  FOR  FURTHER  STUDY 

In  conclusion,  let  us  mention  some  problems  which  have  been  studied  here  slightly  or  not 
at  all,  and  which  we  feel  are  worthy  of  further  research. 

Most  of  our  work  dealt  with  situations  where  the  quantization  was  sufficiently  fine  so  that 
linear  noise  analysis  was  applicable.  It  would  be  interesting  to  study  systems  where  the  quan¬ 
tization  is  very  coarse.  Consider,  for  example,  a  second-order  recursive  filter  with  5-bit 
arithmetic.  Such  a  system  probably  cannot  be  satisfactorily  analyzed  as  a  linear  system  with 
additive  quantization  noise.  The  system  is  highly  nonlinear,  and  would  exhibit  such  effects  as 
complicated  limit  cycles.  We  would  probably  want  to  use  clamping  in  such  a  system,  and  this 
would  complicate  the  analysis  even  further.  However,  the  system  has  the  advantage  that  hard¬ 
ware  requirements  are  low,  and  computational  speed  could  be  very  high.  A  study  of  such  highly 
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quantized  systems,  finding  practical  applications  for  them  and  some  theory  to  describe  their 
operation,  would  be  of  much  interest.  This  type  of  study  need  not  be  restricted  to  recursive  fil¬ 
ters,  but  could  include  the  FFT  as  well.  For  example,  the  effects  of  very  coarse  coefficient 
quantization  in  an  FFT  might  be  investigated. 

Another  problem  area  is  the  question  of  quantization  effects  in  a  very  long  (say  10f  -point) 
or  multidimensional  FFT.  The  theory  we  developed  in  Sec.  Ill  can  be  used,  to  some  extent,  to 
analyze  this  problem.  But  to  know  for  certain  whether  our  results  can  be  extrapolated  straight¬ 
forwardly  for  very  long  transforms,  we  would  need  to  perform  experiments.  Just  programming 
a  10  -point  FFT  could  be  a  large  project,  since  most  computer  memories  could  not  accommo¬ 
date  all  the  data,  and  efficient  use  of  secondary  storage  (drums,  disks,  tapes)  would  have  to  be 
arranged. 

In  an  FFT  filter,  it  is  often  advantageous  to  do  simultaneous  filtering  of  two  real  signals, 
using  one  as  the  real  part  and  the  other  as  the  imaginary  part  of  the  input  to  the  filter.  Because 
of  quantization,  there  will  be  crosstalk  between  these  two  signals,  that  is,  each  of  the  two  output 
signals  will  be  affected  by  both  input  signals.  This  problem  deserves  some  investigation. 

Finally,  all  our  theoretical  quantization  noise  analysis  has  treated  the  case  of  rounding, 

rather  than  truncation.  Some  theoretical  work  for  the  case  of  truncation  would  be  useful.  For 

example,  could  it  be  explained  theoretically  why,  in  a  floating  point  FFT  with  truncation,  the 

2 

experimental  noise-to-s ignal  ratios  tend  to  be  proportional  to  (log^N)  rather  than  to  log£N  as 
for  rounding?  This  problem  involves  modifying  the  roundoff  error  model  to  account  for  the 
signal-noise  correlation  caused  by  truncation,  the  analysis  would  probably  be  fairly  tedious. 
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